Due to the lapse in federal government funding, NASA is not updating this website. We sincerely regret this inconvenience.
// Portable Distributed Scripts (PoDS)
Run many independent serial jobs concurrently
Some users may want to run a large set of sequential jobs, for example, post-processing or data archiving jobs, on Discover. Portable Distributed Scripts (PoDS) is a set of scripts created by SSSO that enables users to execute a series of independent, sequential jobs concurrently on Discover's multi-core nodes.
Please refer to Modeling Guru for detailed information on PoDS. A brief instruction on setting up jobs with PoDS on Discover is summarized below. Here is a simple batch job script using PoDS:
#!/usr/bin/csh
#SBATCH -J Test_Process_data
#SBATCH --ntasks=18 --ntasks-per-node=9
#SBATCH --time=12:00:00
#SBATCH -o output.%j
#SBATCH --account=xxxx
setenv workdir /discover/nobackup/myuserid/workdir
cd $workdir
/usr/local/other/PoDS/PoDS/pods.py -x $workdir/execfile -n 9
exit 0
The syntax of using the PoDS scripts is as follows:
/usr/local/other/PoDS/PoDS/pods.py -x /absolute/path/to/execfile -n cpus_per_node
The execution file includes a list of commands to be executed, with each command listed on each line of the file. For example, the "execfile" may look like the following
./process_data1.sh 1 > $TMPDIR/out.1.1
./process_data2.sh 1 > $TMPDIR/out.2.1
./process_data3.sh 1 > $TMPDIR/out.3.1
./process_data1.sh 2 > $TMPDIR/out.1.2
./process_data2.sh 2 > $TMPDIR/out.2.2
./process_data3.sh 2 > $TMPDIR/out.3.2
./process_data6.sh 6 > $TMPDIR/out.6.6
Here, the "execfile" may contain many lines of independent tasks. In this example, 18 of them will be running concurrently on 2 compute nodes, with 9 tasks on each node.
Special characters and output redirection
Note: Important! If each command writes a lot to standard output, not following the suggestions here can severely impact job performance or even cause it to be prematurely killed.
PoDS commands are interpreted by /bin/sh as they are passed to ssh for launching on remote nodes. This means any special characters (e.g. variables, output redirection) will be interpreted by /bin/sh on the node launching the pods command. For example, in the sample "execfile" from above the standard out from the process_dataX.sh scripts would get passed over ssh to the head node of the job and the head node would perform the redirection to the output file. If you have a large number of tasks in the job then this can overwhelm the head node of the job with output redirection (assuming the job is chatty and writes frequently to standard out).
To avoid this simply escape the redirection character by entering a \ before the >. The above "execfile" would then look like this:
./process_data1.sh 1 \> $TMPDIR/out.1.1
./process_data2.sh 1 \> $TMPDIR/out.2.1
./process_data3.sh 1 \> $TMPDIR/out.3.1
./process_data1.sh 2 \> $TMPDIR/out.1.2
./process_data2.sh 2 \> $TMPDIR/out.2.2
./process_data3.sh 2 \> $TMPDIR/out.3.2
./process_data6.sh 6 \> $TMPDIR/out.6.6
Note that in this case the value of $TMPDIR will be interpreted by the shell on the head node of the job. This generally will be ok but if you say wanted to include the node's hostname anywhere in the job you would escape the $HOSTNAME variable with a \ so that variable interpolation happened on the remote host running the command entry. Here's an example:
./process_data1.sh \$HOSTNAME
./process_data2.sh \$HOSTNAME