March 31, 2001
One way to avoid or minimize the influence of statistical flucuations is not to use the data that determines the structure of the analysis in the actual analysis itself. One would have to divide the data into a set which was to be used to determine the analysis parameters and techniques and a set that would be actually used for doing the measurements. The first sample is very similar to the "training sample" that is always needed for neural network analyses.
While actively dividing samples is this manner is not usually done in experiments, this techniques is often used passively. Experiments will analyze the first portion of their data, present their results and then use the remaining (usually bigger) sample of their data to see if the results are confirmed. Examples of this type of processing are KTeV, which initally analyzed 1/2 of their 1997 run or 1/4 of their total data set, SNO which claims to still have "statistically significant" data set left to be analyzed, and the g-2 Experiment which has about 3/4ths of their data still in the can. In all of these experiments, the data were divided cronologically. Thus the changes in the data with time would have to be tracked with changes in the analysis programs. A better (less susceptible to unconsciencous bias) would be to divide the data in such a manner that all types of data from all running conditions were present in the initial "training" sample as well as in the final analysis sample.
In an ideal case, the sample used to tune the analysis methods would not be used in the final measurement. If the first sample is small enough, it can be thrown away without effecting the statistical significance of the final result very much. However, if this is not possible, then including the "training" sample in the final result would allow the the biases to enter the final result but only at the significance of (NTraining/NTotal)1/2 (where NTraining is the size of the initial sample used for tuning the analyses and NTotal is the size of the total sample).
It should be noted that this guard against statistical flucuation biases can be incorporated with any other "blindness" scheme. If we were to hold secret some features of the final result until we were sure that we were satisfied with our analysis and then reveal these features to calculate the final answer, we could test this procedure first on a portion of the data before we did it for the entire data set. This would allow us to "open the box" for a small portion of the data, see if the results presented there had any special properties that we did not anticipate. If they did, we could refine our analysis and have a statistically independent sample to analyze again.
I propose that we divide our data into eight different equally sized samples. Each sample should have events uniformly distributed throughout the run. We begin by analyzing the first sample (1/8th) of the data. When we are satisified with that analysis, we double the sample size by including the original group and another group. The third step would be to double the size again to include 4 sets, and our final analysis would double the size again to include all 8 sets.
If we are able to categorize the event into event types, we may wish to allow the analyses of the different event types to progress at different speeds. For example, we may decide as a collaboration that we can easily recognize Michel electrons. We then could allow people to access all of the Michel electron event from data sets 1-4 while still only allowing only data from set 1 to be released for other event types. Similarly, we may decide that cosmic ray triggers of sets 1-3 could be released, but all of the light flasher records and all of the cosmic rays with hit a cube could be released.
Finally, we all recognize that the first month or two of running will be a very intense period for both debugging the detector and for developing the analysis programs. Unfortunately, this is also the time when we have very few events. I would propose that the data accumulated during this period be freely available, but that these data not be included in the final analysis.
As we presently envision our system, data is collected in the detector hall and shipped to the Fermilab cluster. In the cluster, an "event server" supplies data on request to analysis programs. The procedure for assuring restricted access to the data would be implemented in our data server.
All events are made available to the "on-line" monitoring functions and event displays, i.e. to the computers in the detector hall. However, these computers will only be allowed to do a restricted analysis on the events. Only "detector parameters" are monitored on-line (i.e. phototube pulse heights and timings, spill timings, trigger type frequencies, etc.); no physics quantities such as number of electron type events will be calculated. As the data is written to the data stream a random number from 1 to 8 is assigned to each event. Once the data is written into the data stream, our data serve would only allow the users to retrieve a certain subset of the data as determined by the random numbers. The server would contain routines that would recognize different event types and would only serve events of that type if its random number was in the allowable set. The allowable set numbers would be changed only when the collaboration agreed that it was time to change them. Data written to the event stream during the first month or so would be labelled type "0" and would always be available.