Batch Submission on FSGI02

Batch Submission on FSGI02

Beginning the week of April 28, long running jobs that are run on fnalu outside of the batch queue have been reniced to -19 and a message sent to their owners once the CPU time accumulated exceeded 30 minutes. We are monitoring the effect of this on machine performance but indications are that it will be insufficient to free up machine resources when under heavy load. The next step will be to kill long running background/interactive jobs run outside of the batch system. This will be implemented first on FDEI01 no sooner than 19 May and the following week for FSGI02 and FSUI02. Please use the interrum period to determine if your processing jobs will run under the fbatch system.

There is documentation for the fbatch commands (and the underlying LSF) through man and under the Computing Division Web pages. I recommend that average SELEX users restrict themselves to fsgi02 for the time being (due to our heavy use of local disk there).

Here is a sample soap job that I just ran as batch on fsgi02.

I setup the following products:

setup -d off781
setup fbatch

I created a sample soap command file by appending the following lines to the file $OFF781_EXA/filter.cmd (note: do NOT use $OFF781_EXA/example.cmd -- it is broken)

disk in /usr/e781/data01/standard/charm_run008915_001
anal 1
exit

Then I could submit a job in the following manner:

 fbatch_sub -q 30min -R fsgi02 $OFF781_BIN/soap.exe -f example.cmd
Enter AFS password:
fbatch_sub executing LSF command locally on fsgi02....
Job <29450> is submitted to queue <30min>.

The -q parameter designates the queue name to which the job is submitted. The -R parameter indicates a required resource (in this case the node on which to run the job) The rest of the line ("$OFF781_BIN/soap.exe -f example.cmd") can be any executable Unix command (remember to chmod +x your scripts) but is typically the command you issued to submit a background processing job currently. (No & needed) The output from the command is mailed back to $USER@fnal.gov along with the batch job completion information. Standard output and standard error can be redirected by -o and -e parameters but it's probably easier to just make a file containing the command line with all the redirections and issue that as the batch command. Use the same login shell type (bash...) for the scripts for best results.

To view the status of your batch jobs:

 fbatch_jobs -u $USER
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
29450 dane     RUN   30min      fsgi02      fsgi02      *ap.exe -f May  2 22:06

To see what queues are available and how busy they are:

 fbatch_queues
QUEUE_NAME     PRIO NICE     STATUS      MAX  JL/U JL/P NJOBS  PEND  RUN  SUSP
test_queue      99    0   Open:Active      -    -    -     0     0     0     0
e831_long       16   17   Open:Active      1    1    -     0     0     0     0
e831_z          16   17   Open:Active      2    2    -     1     0     1     0
e831_short      14   14   Open:Active      -   10    -     0     0     0     0
30min           10   10   Open:Active      -    5    5     1     1     0     0
30min_disk      10   10   Open:Active      -    5    2     1     0     1     0
4hr              8   14   Open:Active      -    5    -     0     0     0     0
4hr_disk         8   14   Open:Active      -    5    2     4     0     4     0
e781_disk        8   17   Open:Active      -    6    6     0     0     0     0
12hr             6   17   Open:Active      -    5    5     1     0     1     0
12hr_disk        6   17   Open:Active      -    5    2     4     1     2     1
1day             4   17   Open:Active      -    5    5     4     0     4     0
1day_disk        4   17   Open:Active      -    5    1     1     0     1     0
4day             2   17   Open:Active      -    5    5     5     0     5     0

The xxx_disk queues run on the old CLUBS nodes and have some complications for the SELEX user. Not recommended for casual users.

The e781_disk queue is reserved for stripping jobs at this time. It will likely be opened up to the collaboration after this next stripping pass.

Let me know if you have problems or you can always contact the helpdesk.

Last update 12 May 1997 by Dane Skow

Running Batch Jobs

Reasons for a Batch System

In order to preserve a usable interactive working environment on fsgi02 and other fnalu nodes, the computing division now requires all jobs using more than 30 minutes of CPU to be run in the batch system. See the computing division statement.

Using the Batch System

There are a variety of considerations when running batch on the FNALU cluster, and the documentation has been broken down to address the various levels of complexity.


dane@fnal.gov