Online Computing at SELEX(E781) Peter S. Cooper, Jurgen Engelfried, Michael Procario May 2, 1995 Abstract This note is an attempt to outline the requirements for the online computing hardware necessary to filter the data, monitor and control the experiment and provide reliable operation with high efficiency during data taking. Introduction SELEX is a high statistics charmed baryon hadronic production and decay experiment. It incorporates an online software filter to reduce the number of charm candidates written to tape by a factor of ~20. This filtering technique is very powerful in terms of shortening the subsequent analysis time required to produce Physics results from the experiment. It should also allow charm signals to be observed while the data is being taken in a "near-online" analysis mode where data is analyzed in the period of several days or weeks after it is taken. This capability will allow us to optimize the charm yield of both the electronic trigger and software filter to improve the sensitivity of the experiment. In order to make this technique work and to fully exploit its potential SELEX requires substantial online computing resources during the data taking phase of the experiment. Online computing, in the context of this note, means the general purpose computing resources needed for experimental control, filtering and monitoring located at the experiment. The other computing resources required for the experiment are the DART data acquisition system, up to but not including the processors used for filtering, and the processing capabilities provided in the Feynman Computing Center. This breakdown is illustrated in Figure 1. DART ONLINE Computing Center DART Lab ------- Data Path ---------- Network --------- | DART |=========>| Filter |<==================>| FCC | ------- |Processors| |Computers| | | (FN781A) |-->Data Tape Drives --------- |--------------| |-->Spooling Disk | ethernet ---------- | | ---------- |--------------| Control | | | (FN781B) | | ---------- | | ---------- --------------| Monitor |-->Shared Disk & Tape | (new) | ---------- Figure 1 - Division of all SELEX Computing Resources The present state of the complete SELEX DAQ/computing system is as follows. The DART data acquisition system is well developed and essentially a defined system when viewed by filter, control and monitoring processes. The Computing Center is a mature set of resources. There is a major upgrade of network capability in progress which will substantially increase the available bandwidth between SELEX and the Computing Center. The SELEX plan for use of Computing Center resources during data taking is described in Reference 1. The present Online computing resources consist of a 2-150 MHz processor SGI Challenge L computer (FN781A) which has been installed for 18 months as the prototype of the Filter Processors. All DART and experiment filter software is designed and tested on only this platform (Ref 2). The plan has been to expand this machine to a full complement of disk, tape and CPUs to meet the Filter Processor requirements. A dedicated SGI Indy workstation (FN781B) was installed 9 months ago to play the role of the control machine. Presently all other function, including those indicated as Monitor in Figure 1 are presently also done on the 'filter' machine (FN781A). FN781A was installed with 10 Gb of disk, a tape drive for backup and 20 X terminals. Requirements In the following section we state then motivate the general requirements for the experiment's online computing resources. (The central column of Figure 1 above.) R1. The Filter Processors are based upon SGI Challenge computing Technology. R2. Data logging will be done through the Filter machine using disk spooling. (See Ref 3). R3. No user processes will be allowed on the Filter or Control processors. R4. In consequence of R3 there must be at least one more machine (labeled Monitor in Figure 1) to provide file service, a place to run monitoring jobs and user computing. R5. Failover - No component failure will be allowed to cause more than one shift of downtime to the experiment. R6. Filter Development - There must be an identical single processor with direct access to the online filebase for development, debugging and timing of the filter process. These activities may not interfere with normal data taking. R7. Extensibility - It must be possible to expand any of the crucial system parameters (CPU power, memory, storage, etc.) by a factor of at least two without reaching an architectural limit. R8. Schedule - All Online Computing components must be installed in the experiment six months before the beam is scheduled to arrive. (October 15, 1995 by the present schedule.) The first four requirements (R1-4) are imposed by prior decisions and acquisitions. We have two years of effort and 1/4 of the funds budgeted invested in the SELEX DAQ prototype system described above. This work has, of necessity, caused us to make certain technical choices. These choices imply requirements on the balance of the system in order to reach a full configuration which can achieve SELEX's experimental goals. The rational for these requirements is based in the degree to which SELEX relies upon the methodology of online filtering in software. To a large extent the experiment is designed around this technique. Without a filter up and running the experiment is dead in the water. The total investment in the experiment (10M$) per shift of data taking (1000 shifts) gives a value of 10K$ per shift. This calculation sets a scale for the incremental cost of downtime. We note that CDF just this past weekend reported three days of downtime for one of their SGI Challenge servers. (Their online filter machines.) Such an outage at SELEX, without some failover capability, would have us completely down for those three day. This is completely unacceptable and is the direct motivation for R5. Filtering, even after the decade of experience that one of us (PSC) has had with it, is not an exact science. The optimization of this class of algorithms under the simultaneous constraints of efficiency, rejection, speed and Physics scope is an ongoing process during much of the life of the experiment. The pace at which the quality of the filter process can be improved is directly related to the quality of the development tools used and the similarity between the development testbed and the actual filter processor environment. Typically these codes use >95% of the available CPU resources. Said the other way, the compromises and tradeoffs associated with optimization are halted when you stop running out of CPU. This means that ~1% measures of speed and performance are necessary. These ideas form the basis of R6. Our present estimate of the CPU resources required to filter SELEX data at the proposed trigger rate is 1700 MIPs. This is an old number based upon data from our test beam run in 1991 (Ref 4). We believe that this number is high but a reasonable estimate on its error is 30%. We are presently within a month or two of having our GEANT simulation and filter codes integrated into the first version of a real SELEX filter. This process will give a better estimate of our real CPU requirements. GEANT, however, does not produce real data. We can and will go through a level of optimization of these algorithms using simulation data. Ultimately only real interactions in the SELEX detector will define our real requirements. It is possible that we have underestimated the difficultly of our problem or that the Physics reach of the experiment could be significantly enchanced by an expansion of filter computing resources. We acknowledge that the possibility of such expansion does not imply the approval or funding to do so. In all cases in would be irresponsible to design a system which could not be expanded. This is the basis of R7. R8 is a SELEX policy for all apparatus in the experiment. Implementation Plan We propose to complete the online computing plant for SELEX with the addition of a single machine. This should be a fully configured SGI Challenge L with 12 CPU's, 128 Mb of main memory, 18 Gb of spooling disk and appropriate number of "recommended" tape drives (5 if they are Exabyte 8505's.) Our CPU benchmark numbers for an SGI R4400 processor are 108 MIPs at 150 MHz clock rate and 145 MIPs at 200 MHz clock rate (Ref 5). This system would have either 1300 or 1700 MIPs depending upon which clock rate processor we acquire. This is a reasonable match to our CPU requirement estimate. This plan meets R5 (Failover) by having two of all critical components installed and working in the experiment at all times. The worst single point failure, and the one from which it is hardest to recover, is the failure of the computer backplane and power supply box. In that case we would shift the CPU boards and SCSI busses with the peripherals to the other processor box. Such a changeover would likely take a shift to get working properly. This assumes that we had planned and tried this kind of swap beforehand. It would also require us to share the filter machine with the monitoring process and some limited use by users. This temporary mode of operation would give us >75% of our normal capability until SGI could get in and fix the problem. If one of the three processor boards in the filter machine died we would run with 2/3 of the CPU until we could have it replaced. For all other system components (disks, tape drives, VME interfaces, etc.) there are spares available on site. R6 (Filter Development) is trivial with this implementation; we have two identical machines sharing the same filebase. We can expand the computing resources (R7 Extensibility) by adding processor boards to the 'monitoring' machine. It would require a second VME interface and perhaps another spooling disk/tape subsystem. This solution would force us to compromise R3 (no users on the filter machine). Our present understanding is that SGI is in the process of desupporting the 150 MHz processors presently on FN781A. We are told that the new 200 MHz processors require a upgrade to the backplane which makes the operation of systems with mixed 150 MHz and 200 MHz processors impossible. This or any other hardware or software incompatibility between the filter machine and FN781A requires that FN781A be upgraded to match the the filter machine in order to maintain interchangeablility. We know of no reason that a system of the type we proposed here cannot be acquired and installed by 10/15/95 as required by R8. Alternatives The only clear way we can see to meet these requirements without the level of vendor and hardware specificity proposed here is to transfer an existing Fermilab SGI Challenge class machine to SELEX and replace the transferred machine. Since there are such machines on site which will not we associated with running experiments the requirements for replacing one of those machines is likely much less stringent than ours. References (All references are available on the SELEX WWW Server) 1. Offline Computing at SELEX(E781) During Data Taking, Peter S. Cooper January 12,1995 2. Test Results on a Ten Processor Challenge L, Jurgen Engelfried, July 6, 1994 3. Data Logging in E781, Peter Cooper, December 27, 1994 4. Timing Tests of the E781 Impact Parameter Trigger Algorithm on RISC Workstations, Mike Procario, Jim Russ, March 3,1994, H-665 5. Benchmarks for different Computer Systems, Jurgen Engelfried, February 26,1995