Worker Nodes use

This guide serves as a comprehensive resource designed to offer both an overview and a practical guide to the fundamental commands and usage patterns of the Graviton cluster’s job scheduler. In the realm of high-performance computing, efficient and effective job scheduling is crucial. GRAVITON employs HTCondor, a renowned and open-source distributed job management system, to orchestrate the computational tasks distributed across its network. HTCondor is specifically engineered to optimize the utilization of computing resources, enabling users to queue up, manage, and monitor jobs across a distributed computing infrastructure. It is flexible, powerful, and capable of handling a wide array of tasks, making it an ideal choice for environments where resource management and task scheduling are of paramount importance.

Basic Commands

Below is a basic guide for users, covering essential commands and an example submission file named hello_world.sub.

  1. Submitting a Job To submit a job to HTCondor, use the condor_submit command followed by your submission file name.

    graviton_user@gr01:~$ condor_submit hello_world.sub
  2. Monitoring Your Job The condor_q command displays the status of your submitted jobs.

    graviton_user@gr01:~$ condor_q
  3. Removing a Job To remove a job from the queue, use the condor_rm command with your job ID.

    graviton_user@gr01:~$ condor_rm [Job ID]
  4. Job Status You can check the status of all jobs using condor_status.

    graviton_user@gr01:~$ condor_status

HTCondor Vanilla Universe

The vanilla universe in HTCondor is intended for most programs. Shell scripts are another case where the vanilla universe is useful.

To execute a specific command or program through Condor using the Vanilla Universe, it is most appropriate to create a bash executable. In this way, what is sent to the Job Scheduler is this executable. Let’s imagine we want to execute the Python3 code hello_world.py through HTCondor. First, we will create the bash script (hello_world.sh).

#!/bin/bash

##Uncoment this part if you uses conda environment
##-------------------------------
#EXERCISE_ENVIRONMENT="environment_name"
#eval "$(conda shell.bash hook)"
#conda activate $EXERCISE_ENVIRONMENT
##--------------------------------


##script
##----------
##executable arg_1 arg_2

python3 hello_world.py

Once the bash execution script is created, we must give it execution permissions. To do this, simply write the following command in the CLI:

graviton_user@gr01:~$ chmod +x hello_world.sh

Now that the file has execution permissions, we can upload it to HTCondor. To do this, we need to create the submission file (hello_world.sub). Here is an example of it for Vanilla:

#########################################
#Not modify this part
universe = vanilla
#########################################

#name of the bash script
executable              = hello_world.sh
arguments               = $(Process)

#Path to the log files (In your first run, make the directory condor_logs)
log                     = condor_logs/log.log
output                  = condor_logs/outfile.$(Cluster).$(Process).out
error                   = condor_logs/errors.$(Cluster).$(Process).err

##Uncoment this line if you use conda environments
#getenv = True

#number of CPUs requested
request_cpus = 3


#For cases where the required files are not in the /home directory
##################################
#should_transfer_files = yes
#when_to_transfer_output = on_exit
#transfer_input_files = file_1 file_2

#Requirements
#####################
#To exclude specific machines
#requirements = ( Machine != "gr01" && Machine != "gr02" )
#To run in a specific machine
#requirements = ( Machine == "gr03")

#send to the queue
queue

Detailed Explanation of an HTCondor Vanilla Universe Submission File

  • universe = vanilla: Specifies the HTCondor “universe” for the job. The vanilla universe is a basic execution environment suitable for most jobs.

  • executable = hello_world.sh: This line defines the script or executable that HTCondor will run, which in this case is hello_world.sh.

  • arguments = $(Process): Sets the arguments to be passed to the executable. $(Process) is a built-in HTCondor variable indicating the process number in a batch of jobs.

  • log = condor_logs/log.log: Path to the file where HTCondor will write the job’s log.

  • output = condor_logs/outfile.$(Cluster).$(Process).out: Specifies the path and file name for standard output. $(Cluster) and $(Process) are variables representing the cluster ID and process number, respectively.

  • error = condor_logs/errors.$(Cluster).$(Process).err: Path and name for the file where standard error output will be written.

  • getenv = True: Uncommenting this line allows the HTCondor job to inherit environment variables from the submitting environment, useful for Conda environments or specific environment variables.

  • request_cpus = 3: Indicates the number of CPUs requested for the job.

  • should_transfer_files = yes: Instructs whether files should be transferred to the execution node.

  • when_to_transfer_output = on_exit: Determines when to transfer output files, in this case, upon job completion.

  • transfer_input_files = file_1, file_2: Lists the files to be transferred to the execution node.

  • requirements = (Machine != "gr01" && Machine != "gr02"): Sets specific requirements for the execution machine, excluding certain machines.

  • requirements = (Machine == "gr03"): Restricts job execution to a specific machine.

  • queue: This command places the job into the HTCondor queue for execution. Without this line, the job would not be submitted to the queue.

Each line in this submission file configures how HTCondor will handle and execute your job, from setting the execution environment to specifying system resources and requirements.

HTCondor Parallel Universe

The parallel universe allows parallel programs, such as MPI jobs, to be run within HTCondor. This allows for the use of GRAVITON’s worker nodes for parallel computing.

Imagine we want to launch our C++ program hollo_world_MPI.cpp through HTCondor using several worker nodes. First, we would need to compile the program on the User Interface, exactly the same way as if we were launching it directly via CLI.

graviton_user@gr01:~$ mpicxx -o hello_world_mpi hello_world_mpi.cpp

Although GRAVITON allows for compilation on worker nodes, it’s usually more convenient to compile on the User Interface. Next, we will have to send a submission file (hello_world_mpi.sub) with the following structure to the scheduler:

#########################################
#Not modify this part
universe = parallel
executable = /scripts_condor/parallel/openmpiscript
#########################################

# mpi executable and arguments:
#arguments = executable arg1 arg2 arg3
arguments = hello_world_mpi

# Number of machines requested
machine_count = 2
# CPUs per machine
request_cpus = 45

#Path to the log files (In your first run, make the directory condor_logs)
log                     = condor_logs/logs.log
output                  = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out
error                   = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

#For cases where the required files are not in the /home directory
##################################
#should_transfer_files = yes
#when_to_transfer_output = on_exit
#transfer_input_files = file_1 file_2

#Requirements
#####################
#To exclude specific machines
#requirements = ( Machine != "gr01" && Machine != "gr02" )
#To run in a specific machine
#requirements = ( Machine == "gr03")

#send to the queue
queue

In this way, in our example, the total number of CPUs used will be request_cpus x machine_count = 90 CPUs. The output obtained when launching our code through HTCondor will be:

Hello World from the main process (rank 0) of 90 processes.
Hello World from the main process 3 de 90.
Hello World from the main process 62 de 90.
Hello World from the main process 4 de 90.
...

Detailed Explanation of an HTCondor Parallel Universe Submission File

  • universe = parallel: Specifies the HTCondor universe as parallel. This universe is used for parallel jobs, typically involving MPI (Message Passing Interface).

  • executable = /scripts_condor/parallel/openmpiscript: The path to the executable that HTCondor will run. In this case, it is an MPI script located at /scripts_condor/parallel/openmpiscript. Es propio de HTCondor y no es necesario modificarlo. No obstante, si por necesidades de la estructura que se está intentando ejecutar fuese necesario hacer realizar cambios en el, se puede copiar en el directorio local y hacer las modificaciones adecuadas.

  • arguments = hello_world_mpi: Defines the arguments passed to the MPI script. Nuestro programa hello_world_mpi, seguido de todos los argumentos necesarios para ejecutarlo.

  • machine_count = 2: Specifies the number of Worker Nodes requested for the parallel job.

  • request_cpus = 45: Indicates the number of CPUs per Worker Node requested for the job.

  • log = condor_logs/logs.log: Path for the logs file.

  • output = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out: Path and filename pattern for standard output. $(NODE) is a variable specific to parallel universe jobs.

  • error = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err: Path and filename pattern for standard error output.

  • +ParallelShutdownPolicy = "WAIT_FOR_ALL": This line specifies the shutdown policy for parallel jobs. WAIT_FOR_ALL means the job will not complete until all parallel nodes have completed.

  • should_transfer_files = yes: Determines if files need to be transferred to the execution node.

  • when_to_transfer_output = on_exit: Specifies when to transfer output files, usually upon job completion.

  • transfer_input_files = file_1, file_2: Lists files to be transferred to the execution node.

  • requirements = (Machine != "gr01" && Machine != "gr02"): Sets specific requirements for the execution machine, excluding certain machines.

  • requirements = (Machine == "gr03"): Restricts job execution to a specific machine.

  • queue: Places the job into the HTCondor queue for execution.