Beginning the week of April 28, long running jobs that are run on fnalu outside of the batch queue have been reniced to -19 and a message sent to their owners once the CPU time accumulated exceeded 30 minutes. We are monitoring the effect of this on machine performance but indications are that it will be insufficient to free up machine resources when under heavy load. The next step will be to kill long running background/interactive jobs run outside of the batch system. This will be implemented first on FDEI01 no sooner than 19 May and the following week for FSGI02 and FSUI02. Please use the interrum period to determine if your processing jobs will run under the fbatch system.
There is documentation for the fbatch commands (and the underlying LSF) through man and under the Computing Division Web pages. I recommend that average SELEX users restrict themselves to fsgi02 for the time being (due to our heavy use of local disk there).
Here is a sample soap job that I just ran as batch on fsgi02.
I setup the following products:
setup -d off781 setup fbatch
I created a sample soap command file by appending the following lines to the file $OFF781_EXA/filter.cmd (note: do NOT use $OFF781_EXA/example.cmd -- it is broken)
disk in /usr/e781/data01/standard/charm_run008915_001 anal 1 exit
Then I could submit a job in the following manner:
fbatch_sub -q 30min -R fsgi02 $OFF781_BIN/soap.exe -f example.cmd Enter AFS password: fbatch_sub executing LSF command locally on fsgi02.... Job <29450> is submitted to queue <30min>.
The -q parameter designates the queue name to which the job is submitted. The -R parameter indicates a required resource (in this case the node on which to run the job) The rest of the line ("$OFF781_BIN/soap.exe -f example.cmd") can be any executable Unix command (remember to chmod +x your scripts) but is typically the command you issued to submit a background processing job currently. (No & needed) The output from the command is mailed back to $USER@fnal.gov along with the batch job completion information. Standard output and standard error can be redirected by -o and -e parameters but it's probably easier to just make a file containing the command line with all the redirections and issue that as the batch command. Use the same login shell type (bash...) for the scripts for best results.
To view the status of your batch jobs:
fbatch_jobs -u $USER JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 29450 dane RUN 30min fsgi02 fsgi02 *ap.exe -f May 2 22:06
To see what queues are available and how busy they are:
fbatch_queues QUEUE_NAME PRIO NICE STATUS MAX JL/U JL/P NJOBS PEND RUN SUSP test_queue 99 0 Open:Active - - - 0 0 0 0 e831_long 16 17 Open:Active 1 1 - 0 0 0 0 e831_z 16 17 Open:Active 2 2 - 1 0 1 0 e831_short 14 14 Open:Active - 10 - 0 0 0 0 30min 10 10 Open:Active - 5 5 1 1 0 0 30min_disk 10 10 Open:Active - 5 2 1 0 1 0 4hr 8 14 Open:Active - 5 - 0 0 0 0 4hr_disk 8 14 Open:Active - 5 2 4 0 4 0 e781_disk 8 17 Open:Active - 6 6 0 0 0 0 12hr 6 17 Open:Active - 5 5 1 0 1 0 12hr_disk 6 17 Open:Active - 5 2 4 1 2 1 1day 4 17 Open:Active - 5 5 4 0 4 0 1day_disk 4 17 Open:Active - 5 1 1 0 1 0 4day 2 17 Open:Active - 5 5 5 0 5 0
The xxx_disk queues run on the old CLUBS nodes and have some complications for the SELEX user. Not recommended for casual users.
The e781_disk queue is reserved for stripping jobs at this time. It will likely be opened up to the collaboration after this next stripping pass.
Let me know if you have problems or you can always contact the helpdesk.
Last update 12 May 1997 by Dane Skow
In order to preserve a usable interactive working environment on fsgi02 and other fnalu nodes, the computing division now requires all jobs using more than 30 minutes of CPU to be run in the batch system. See the computing division statement.
There are a variety of considerations when running batch on the FNALU cluster, and the documentation has been broken down to address the various levels of complexity.