Data Logging in E781

Data Logging in E781

Peter Cooper

December 27, 1994

Abstract

In light of the decision we have recently taken to ship all data to the filter farm and log data to tape from there I have rethough what our plans for data logging should be. I have come to the conclusion that we should log to disk rather than tape.


Our original data logging plan was to have a set of 8mm tape drives controlled by a Rimfire VME scatter-gather tape drive controller in the VME crate with the memories which hold the data from the previous spill. This was to be driven by a process in the MVME167 which would determine from the filter processes which charm events were to be taped, build the scatter gather list for a complete tape record of events and pass it to the Rimfire for tapping. We also planned for a second output stream for the Primakoff and "Other" data.

These events would be tagged by trigger bits. The MVME167 would build a second scatter gather list in order to write records of these events to a second set of tape drives. We would write files of about 200 Mb in size with typically 5-10 such files in each output stream for each run.

Now all of these functions have to happen in logging processes in the filter farm. In principle this is all perfectly straight forward. Neither the I/O to move the volume of data nor the CPU required for event and record building is significant. Having these activities done on a real computer system as opposed to a VME computing "island" opens possibilities for very significant gains for the experiment.

The model of operation I am proposing is to do everything in the filter farm exactly as planned but write the resulting record to disk files in a set of "spooling" disks rather than directly to tape. A set of one or more "unspooling" processes running on the filter farm would look for newly closed data files in the spooling area and copy them to the data tapes. The spooling areas would be kept mainly full. When space for a new file was required by a logging process it would look for the oldest logged and unused file in the spooling area and delete it in order to make sufficient space.

Assume a hardware configuration with a 9 Gb disk and 2-3 8mm tape drives on each of two separate SCSI busses on the filter farm. This configuration makes the two output stream independent and provides fail-over in case of a hardware failure on one of the SCSI busses. The experiment logs data at about 1 Mb/sec aggregated over both output streams. This means, given the 450-500 KHz bandwith of an 8mm drive, that one drive on each SCSI is in operation continously. A second drive will run some small fraction of the time.

The advantages of this scheme are many. An 18 Gb spooling space allows the experiment to run for 5 hours without tapping before it fills the spooling area. This is long enough to allow broken tape drives to be replaced without effecting the uptime of the experiment. Copying whole files from disk to tape is the most efficient and reliable way to run 8mm tape drives. They are kept streaming at all times. It is even possible to imagine attempting error recovery on a bad file copy. At the minimum, failover to another drive is trivial. The bookkeeping for which files are on which tapes, etc. is likewise very simple.

The most acttractive features of this scheme have little to do with tapes. With all of the data on disk for an average dwell time of ~4 hours many things we couldn't do before become easy. For example, all the experiment monitoring jobs can be run against this file base rather than the online data stream directly. At our rates the first 200 Mb files of each output stream is closed after the first 8 spills of the run (8 minutes). During normal data taking we don't care if the monitoring is phase shifted by 10 minutes. Assuming that the monitoring jobs are fast we will get high statistics information faster this way than by distributing a bandwidth limited fraction of the events over the network to online monitoring processes. As with the unspooling jobs the failure of a monitoring job is completely decoupled for the data taking and experiment uptime. At worst, a hung job holds one file in the spooling area from being deleted until somebody cleans it up. Of course, the monitoring jobs must be able to connect directly to the online data stream for checkout and other online diagnostics.

This same file base is available for algorithm development, debugging and apparatus studies. Debugging is particularly enhanced. When a job crashes on a rogue event that event can be found again in a disk file. In a true online situation it is probably impossible to get that event back. I will be possible to run fast jobs at the experiment on samples of data as large as a whole run without ever having to mount and read or stage a data tape.

The most important feature, in my view, is that we can now copy ~10% of our data from the spooling area directly to the Computing Center to be cached in their disk/robot/tape hierarchical file storage systems. The bandwidth required for this is an average 100 Kb/sec (1/10 the taping rate). It would be burst driven just as the tape drives are, going at the maximum available network bandwidth for the duration of a 200 Mb file transfer. This is straight forward with the existing network. In general we would copy one file from each data stream for each run to the Center this way.

With a large subset of our data available in the Computing Center we are in position to carry out very significant analyses on those data during the run. We should never have to fight with 8mm tapes and all their associated headaches. The Center's caching systems will do that for us. We have also positioned ourselves to have the data "pre-staged" for the post run algorithm and constant development phase of the analysis. With luck and planning we should never have to read more than a trivial number of our data tapes until we are ready for the PASS1 analysis on the farms in the Computing Center.

We can imagine all sorts of ways to use this capability. Perhaps there should be a class of data files, like prescaled events files, which are cached to the Center and never taped directly. This is a way to implement rapid turn around trigger studies without having to hassle with tape. This scheme, in effect, put the Computer Center online in the experiment in a fault tolerant way. If the network link goes down or gets overloaded our data doesn't get to the Center, but it still is available at the experiment and on tape. Most of the time the network will be fine and we can be running batch jobs in the Center on systems like FNALU and CLUBS against very large subsets of our data.

The disadvantages of this scheme are few and minor. The extra I/O and CPU loads on the filter farm to do 3 I/O's per byte of data rather than one are estimated as <5% of the CPU and negligible in term of I/O bandwidth. The the unspooling jobs are an extra set of software components to be written and maintained by us. The hardware cost is slightly higher. I estimated we would need 7 tape drives plus a Rimfire controller for the original plan with taping from VME. Each of these is about 2K$ for a total hardware cost of 16K$. 2 - 9Gb disks at 4K$ each and 5 - 8mm drives cost about 18K$. This difference is insignificant. We get away with fewer drives because unspooling is offline so rewinds, failures and other interruptions (like not having the next tape mounted and ready to go) cost us no time an therefore can be tolerated with much higher frequency.

There are significant impacts on the Computing Center and smaller impacts on DART if we choose to adopt this plan. I recommend we discuss these ideas and take this decision relatively soon so that the Computing Division will have ample time to respond to our needs.