Data Logging in E781
Peter Cooper
December 27, 1994
Abstract
In light of the decision we have recently taken to ship
all data to the filter farm and log data to tape from there
I have rethough what our plans for data logging should be.
I have come to the conclusion that we should log to disk
rather than tape.
Our original data logging plan was to have a set of 8mm tape drives
controlled by a Rimfire VME scatter-gather tape drive controller in the VME
crate with the memories which hold the data from the previous spill. This was
to be driven by a process in the MVME167 which would determine from the filter
processes which charm events were to be taped, build the scatter gather list
for a complete tape record of events and pass it to the Rimfire for tapping.
We also planned for a second output stream for the Primakoff and "Other" data.
These events would be tagged by trigger bits. The MVME167 would build a
second scatter gather list in order to write records of these events to a
second set of tape drives. We would write files of about 200 Mb in size with
typically 5-10 such files in each output stream for each run.
Now all of these functions have to happen in logging processes in the
filter farm. In principle this is all perfectly straight forward. Neither
the I/O to move the volume of data nor the CPU required for event and record
building is significant. Having these activities done on a real computer
system as opposed to a VME computing "island" opens possibilities for very
significant gains for the experiment.
The model of operation I am proposing is to do everything in the filter
farm exactly as planned but write the resulting record to disk files in a
set of "spooling" disks rather than directly to tape. A set of one or more
"unspooling" processes running on the filter farm would look for newly closed
data files in the spooling area and copy them to the data tapes. The spooling
areas would be kept mainly full. When space for a new file was required by
a logging process it would look for the oldest logged and unused file in the
spooling area and delete it in order to make sufficient space.
Assume a hardware configuration with a 9 Gb disk and 2-3 8mm tape drives
on each of two separate SCSI busses on the filter farm. This configuration
makes the two output stream independent and provides fail-over in case of a
hardware failure on one of the SCSI busses. The experiment logs data at
about 1 Mb/sec aggregated over both output streams. This means, given the
450-500 KHz bandwith of an 8mm drive, that one drive on each SCSI is in
operation continously. A second drive will run some small fraction of the
time.
The advantages of this scheme are many. An 18 Gb spooling space allows
the experiment to run for 5 hours without tapping before it fills the
spooling area. This is long enough to allow broken tape drives to be
replaced without effecting the uptime of the experiment. Copying whole files
from disk to tape is the most efficient and reliable way to run 8mm tape
drives. They are kept streaming at all times. It is even possible to
imagine attempting error recovery on a bad file copy. At the minimum,
failover to another drive is trivial. The bookkeeping for which files are on
which tapes, etc. is likewise very simple.
The most acttractive features of this scheme have little to do with
tapes. With all of the data on disk for an average dwell time of ~4 hours
many things we couldn't do before become easy. For example, all the
experiment monitoring jobs can be run against this file base rather than the
online data stream directly. At our rates the first 200 Mb files of each
output stream is closed after the first 8 spills of the run (8 minutes).
During normal data taking we don't care if the monitoring is phase shifted
by 10 minutes. Assuming that the monitoring jobs are fast we will get high
statistics information faster this way than by distributing a bandwidth
limited fraction of the events over the network to online monitoring
processes. As with the unspooling jobs the failure of a monitoring job is
completely decoupled for the data taking and experiment uptime. At worst, a
hung job holds one file in the spooling area from being deleted until
somebody cleans it up. Of course, the monitoring jobs must be able to
connect directly to the online data stream for checkout and other online
diagnostics.
This same file base is available for algorithm development, debugging
and apparatus studies. Debugging is particularly enhanced. When a job
crashes on a rogue event that event can be found again in a disk file. In
a true online situation it is probably impossible to get that event back.
I will be possible to run fast jobs at the experiment on samples of data
as large as a whole run without ever having to mount and read or stage a
data tape.
The most important feature, in my view, is that we can now copy ~10%
of our data from the spooling area directly to the Computing Center to be
cached in their disk/robot/tape hierarchical file storage systems. The
bandwidth required for this is an average 100 Kb/sec (1/10 the taping rate).
It would be burst driven just as the tape drives are, going at the maximum
available network bandwidth for the duration of a 200 Mb file transfer. This
is straight forward with the existing network. In general we would copy one
file from each data stream for each run to the Center this way.
With a large subset of our data available in the Computing Center we are
in position to carry out very significant analyses on those data during the
run. We should never have to fight with 8mm tapes and all their associated
headaches. The Center's caching systems will do that for us. We have also
positioned ourselves to have the data "pre-staged" for the post run algorithm
and constant development phase of the analysis. With luck and planning we
should never have to read more than a trivial number of our data tapes until
we are ready for the PASS1 analysis on the farms in the Computing Center.
We can imagine all sorts of ways to use this capability. Perhaps there
should be a class of data files, like prescaled events files, which are cached
to the Center and never taped directly. This is a way to implement rapid
turn around trigger studies without having to hassle with tape. This scheme,
in effect, put the Computer Center online in the experiment in a fault
tolerant way. If the network link goes down or gets overloaded our data
doesn't get to the Center, but it still is available at the experiment and
on tape. Most of the time the network will be fine and we can be running
batch jobs in the Center on systems like FNALU and CLUBS against very large
subsets of our data.
The disadvantages of this scheme are few and minor. The extra I/O and
CPU loads on the filter farm to do 3 I/O's per byte of data rather than one
are estimated as <5% of the CPU and negligible in term of I/O bandwidth.
The the unspooling jobs are an extra set of software components to be written
and maintained by us. The hardware cost is slightly higher. I estimated we
would need 7 tape drives plus a Rimfire controller for the original plan with
taping from VME. Each of these is about 2K$ for a total hardware cost of
16K$. 2 - 9Gb disks at 4K$ each and 5 - 8mm drives cost about 18K$. This
difference is insignificant. We get away with fewer drives because unspooling
is offline so rewinds, failures and other interruptions (like not having the
next tape mounted and ready to go) cost us no time an therefore can be
tolerated with much higher frequency.
There are significant impacts on the Computing Center and smaller impacts
on DART if we choose to adopt this plan. I recommend we discuss these ideas
and take this decision relatively soon so that the Computing Division will have
ample time to respond to our needs.