Other io/main

Table Of Contents

Previous topic

Data Model

Next topic

Detector Description

This Page

Daya Bay Links

Content Skeleton

Data I/O

Gaudi clearly separates transient data representations in memory from those that persist on disk. The transient representations are described in the previous section. Here the persistency mechanism is described from the point of view of configuring jobs to read and write input/output (I/O) files and how to extend it to new data.

Goal

The goal of the I/O subsystem is to persist or preserve the state of the event store memory beyond the life time of the job that produced it and to allow this state to be restored to memory in subsequent jobs.

As a consequence, any algorithms that operate on any particular state of memory should not depend, nor even be able to recognize, that this state was restored from persistent files or was generated “on the fly” by other, upstream algorithms.

Another consequence of this is that users should not need to understand much about the file I/O subsystem except basics such as deciding what to name the files. This is described in the section on configuration below. Of course, experts who want to add new data types to the subsystem must learn some things which are described in the section below on adding new data classes.

Features

The I/O subsystem supports these features:

Streams:
Streams are time ordered data of a particular type and are named. In memory this name is the location in the Transient Event Store (TES) where the data will be accessed. On disk this name is the directory in the ROOT TFile where the TTree that stores the stream of data is located.
Serial Files:

A single stream can be broken up into sequential files. On input an ordered list of files can be given and they will be navigated in order, transparently. On output, files closed and new ones opened based on certain criteria.

FIXME This is not yet implemented! But, it is easy to do so, the hooks are there.

Parallel Files:
Different streams from one job need not be stored all together in the same file. Rather, they can be spread among one or more files. The mapping from stream name to file is user configurable (more on this below).
Navigation:
Input streams can be navigated forward, backward and random access. The key is the “entry” number which simply counts the objects in the stream, independent of any potential file breaks. [1]
Policy:
The I/O subsystem allows for various I/O policies to be enforced by specializing some of its classes and through the converter classes.

Packages

The I/O mechanism is provided by the packages in the RootIO area of the repository. The primary package is RootIOSvc which provides the low level Gaudi classes. In particular it provides an event selector for navigating input as well as a conversion service to facilitate converting between transient and persistent representations. It also provides the file and stream manipulation classes and the base classes for the data converters. The concrete converters and persistent data classes are found in packages with a prefix “Per” under RootIO/. There is a one-to-one correspondence between these packages and those in DataModel holding the transient data classes.

The RootIOSvc is generic in the sense that it does not enforce any policy regarding how data is sent through I/O. In order to support Daya Bay’s unique needs there are additional classes in DybSvc/DybIO. In particular DybEvtSelector and DybStorageSvc. The first enforces the policy that the “next event” means to advance to the next RegistrationSequence [2] and read in the objects that it references. The second also enforces this same policy but for the output.

How the I/O Subsystem Works

This section describes how the bits flow from memory to file and back again. It isn’t strictly needed but will help understand the big picture.

Execution Cycle vs. Event

Daya Bay does not have a well defined concept of “event”. Some physics interactions can lead overlapping collections of hits and others can trigger multiple detectors. To correctly simulate this reality it is required to allow for multiple results from an algorithm in any given run through the chain of algorithms. This run is called a “top level execution cycle” which might simplify to an “event” in other experiments.

Registration Sequence

In order to record this additional dimension to our data we use a class called RegistrationSequence (RS). There is one RS created for each execution cycle. Each time new data is added to the event store it is also recorded to the current RS along with a unique and monotonically increasing sequence number or index.

The RS also hold flags that can be interpreted later. In particular it holds a flag saying whether or not any of its data should be saved to file. These flags can be manipulated by algorithms in order to implement a filtering mechanism.

Finally, the RS, like all data in the analysis time window, has a time span. It is set to encompass the time spans of all data that it contains. Thus, RS captures the results of one run through the top level algorithms.

Writing data out

Data is written out using a DybStorageSvc. The service is given a RS and will write it out through the converter for the RS. This conversion will also trigger writing out all data that the RS points to.

When to write out

In principle, one can write a simple algorithm that uses DybStorageSvc and is placed at the end of the chain of top-level algorithms [3]. As a consequence, data will be forced to be written out at the end of each execution cycle. This is okay for simple analysis but if one wants to filter out records from the recent past (and still in the AES) based on the current record it will be too late as they will be already written to file.

Instead, to be completely correct, data must not be written out until every chance to use it (and thus filter it) has been exhausted. This is done by giving the job of using DybStorageSvc to the agent that is responsible for clearing out data from the AES after they have fallen outside the analysis window.

Reading data in

Just as with output, input is controlled by the RS objects. In Gaudi it is the jobs of the “event selector” to navigate input. When the application says “go to the next event” it is the job of the event selector to interpret that command. In the Daya Bay software this is done by DybIO/DybEvtSelector which is a specialization of the generic RootIOSvc/RootIOEvtSelector. This selector will interpret “next event” as “next RegistrationSequence”. Loading the next RS from file to memory triggers loading all the data it referenced. The TES and thus AES are now back to the state they were in when the RS was written to file in the first place.

Adding New Data Classes

For the I/O subsystem to support new data classes one needs to write a persistent version of the transient class and a converter class that can copy information between the two.

Class Locations and Naming Conventions

The persistent data and converters classes are placed in a package under RootIO/ named with the prefix “Per” plus the name of the corresponding DataModel package. For example:

DataModel/GenEvent/ \longleftrightarrow RootIO/PerGenEvent/

Likewise, the persistent class names themselves should be formed by adding “Per” to the their transient counterparts. For example, GenEvent‘s GenVertex transient class has a persistent counterpart in PerGenEvent with the name PerGenVertex.

Finally, one writes a converter for each top level data class (that is a subclass of DataObject with a unique Class ID number) and the converters name is formed by the transient class name with “Cnv” appended. For example the class that converts between GenHeader and PerGenHeader is called GenHeaderCnv.

The “Per” package should produce both a linker library (holding data classes) and a component library (holding converters). As such the data classes header (.h) files should go in the usual PerXxx/PerXxx/ subdirectory and the implementation (.cc) files should go in PerXxx/src/lib/. All converter files should go in PerXxx/src/components/. See the PerGenHeader package for example.

Guidelines for Writing Persistent Data Classes

In writing such classes, follow these guidelines which differ from normal best practices:

  • Do not include any methods beyond constructors/destructors.
  • Make a default constructor (no arguments) as well as one that can set the data members to non-default values
  • Use public, and not private, data members.
  • Name them with simple, but descriptive names. Don’t decorate them with “m_”, “f” or other prefixes traditionally used in normal classes.

Steps to Follow

  1. Your header class should inherit from PerHeaderObject, all sub-object should, in general, not inherit from anything special.
  2. Must provide a default constructor, convenient to define a constructor that passes in initial values.
  3. Must initialize all data members in any constructor.
  4. Must add each header file into dict/headers.h file (file name must match what is in requirements file below.
  5. Must add a line in dict/classes.xml for every class and any STL containers or other required instantiated templates of these classes. If the code crashes inside low-level ROOT I/O related “T” classes it is likely because you forgot to declare a class or template in classes.xml.
  6. Run a RootIOTest script to generate trial output.
  7. Read the file with bare root + the load.C script.
  8. Look for ROOT reporting any undefined objects or missing streamers. This indicates missing entries in dict/classes.xml.
  9. Browse the tree using a TBrowser. You should be able to drill down through the data structure. Anything missing or causes a crash means missing dict/classes.xml entries or incorrect/incomplete conversion.
  10. Read the file back in using the RootIOTest script.
  11. Check for any crash (search for “Break”) or error in the logs.
  12. Use the diff_out.py script to diff the output and intput logs and check for unexplained differences (this may require you to improve fillStream() methods in the DataModel classes.

Difficulties with Persistent Data Classes

Due to limitations in serializing transient objects into persistent ones care must be taken in how the persistent class is designed. The issues of concern are:

Redundancy:
Avoid storing redundant transient information that is either immaterial or that can be reconstructed by other saved information when the object is read back in.
Referencing:
One can not directly store pointers to other objects and expect them to be correct when the data is read back in.

The Referencing problem is particularly difficult. Pointers can refer to other objects across different “boundaries” in memory. For example:

  • Pointers to subobjects within the same object.
  • Pointers to objects within the same HeaderObject hierarchy.
  • Pointers to objects in a different HeaderObject hierarchy.
  • Pointers to objects in a different execution cycle.
  • Pointers to isolated objects or to those stored in a collection.

The PerBaseEvent package provides some persistent classes than can assist the converter in resolving references:

PerRef
Holds a TES/TFile path and an entry number
PerRefInd
Same as above but also an array index

In many cases the transient objects form a hierarchy of references. The best strategy to store such a structure is to collect all the objects into like-class arrays and then store the relationships as indices into these arrays. The PerGenHeader classes give an example of this in how the hierarchy made up of vertices and tracks are stored.

Writing Converters

The converter is responsible for copying information between transient and persistent representations. This copy happens in two steps. The first allows the converter to copy information that does not depend on the conversion of other top-level objects. The second step lets the converter fill in anything that required the other objects to be copied such as filling in references.

A Converter operates on a top level DataObject subclass and any subobjects it may contain. In Daya Bay software, almost all such classes will inherit from HeaderObject. The converter needs to directly copy only the data in the subclass of HeaderObject and can delegate the copying of parent class to its converter.

The rest of this section walks through writing a converter using the GenHeaderCnv as an example.

Converter Header File

First the header file:

#include "RootIOSvc/RootIOTypedCnv.h"
#include "PerGenEvent/PerGenHeader.h"
#include "Event/GenHeader.h"

class GenHeaderCnv : public RootIOTypedCnv<PerGenHeader,
                                           DayaBay::GenHeader>

The converter inherits from a base class that is templated on the persistent and transient class types. This base class hides away much of Gaudi the machinery. Next, some required Gaudi boilerplate:

public:
    static const CLID& classID() {
        return DayaBay::CLID_GenHeader;
    }

    GenHeaderCnv(ISvcLocator* svc);
    virtual ~GenHeaderCnv();

The transient class ID number is made available and constructors and destructors are defined. Next, the initial copy methods are defined. Note that they take the same types as given in the templated base class.

StatusCode PerToTran(const PerGenHeader& per_obj,
                     DayaBay::GenHeader& tran_obj);

StatusCode TranToPer(const DayaBay::GenHeader& per_obj,
                     PerGenHeader& tran_obj);

Finally, the fill methods can be defined. These are only needed if your classes make reference to objects that are not subobjects of your header class:

//StatusCode fillRepRefs(IOpaqueAddress* addr, DataObject* dobj);
//StatusCode fillObjRefs(IOpaqueAddress* addr, DataObject* dobj);

FIXME This is a low level method. We should clean it up so that, at least, the needed dynamic_cast<> on the DataObject* is done in the base class.

Converter Implementation File

This section describes what boilerplate each converter needs to implement. It doesn’t go through the actual copying code. Look to the actual code (such as GenHeaderCnv.cc) for examples.

First the initial boilerplate and constructors/destructors.

#include "GenHeaderCnv.h"
#include "PerBaseEvent/HeaderObjectCnv.h"

using namespace DayaBay;
using namespace std;

GenHeaderCnv::GenHeaderCnv(ISvcLocator* svc)
    : RootIOTypedCnv<PerGenHeader,GenHeader>("PerGenHeader",
                                             classID(),svc)
{ }
GenHeaderCnv::~GenHeaderCnv()
{ }

Note that the name of the persistent class, the class ID number and the ISvcLocator all must be passed to the parent class constructor. One must get the persistent class name correct as it is used by ROOT to locate this class’s dictionary.

When doing the direct copies, first delegate copying the HeaderObject part to its converter:

// From Persistent to Transient
StatusCode GenHeaderCnv::PerToTran(const PerGenHeader& perobj,
                                   DayaBay::GenHeader& tranobj)
{
    StatusCode sc = HeaderObjectCnv::toTran(perobj,tranobj);
    if (sc.isFailure()) return sc;

    // ... rest of specific p->t copying ...

    return StatusCode::SUCCESS;
}

// From Transient to Persistent
StatusCode GenHeaderCnv::TranToPer(const DayaBay::GenHeader& tranobj,
                                   PerGenHeader& perobj)
{
    StatusCode sc = HeaderObjectCnv::toPer(tranobj,perobj);
    if (sc.isFailure()) return sc;

    // ... rest of specific t->p copying ...

    return StatusCode::SUCCESS;
}

For filling references to other object you implement the low level Gaudi methods fillRepRefs to fill references in the persistent object and fillObjRefs for the transient. Like above, you should first delegate the filling of the HeaderObject part to HeaderObjectCnv.

StatusCode GenHeaderCnv::fillRepRefs(IOpaqueAddress*, DataObject* dobj)
{
    GenHeader* gh = dynamic_cast<GenHeader*>(dobj);
    StatusCode sc = HeaderObjectCnv::fillPer(m_rioSvc,*gh,*m_perobj);
    if (sc.isFailure()) { ... handle error ... }

    // ... fill GenHeader references, if there were any, here ...

    return sc;
}

StatusCode GenHeaderCnv::fillObjRefs(IOpaqueAddress*, DataObject* dobj)
{
    HeaderObject* hobj = dynamic_cast<HeaderObject*>(dobj);
    StatusCode sc = HeaderObjectCnv::fillTran(m_rioSvc,*m_perobj,*hobj);
    if (sc.isFailure()) { ... handle error ... }

    // ... fill GenHeader references, if there were any, here ...

    return sc;
}

Register Converter with Gaudi

One must tell Gaudi about your converter by adding two files. Both are named after the package and with “_entries.cc” and “_load.cc” suffixes. First the “load” file is very short:

#include "GaudiKernel/LoadFactoryEntries.h"
LOAD_FACTORY_ENTRIES(PerGenEvent)

Note one must use the package name in the CPP macro. Next the “entries” file has an entry for each converter (or other Gaudi component) defined in the package:

#include "GaudiKernel/DeclareFactoryEntries.h"
#include "GenHeaderCnv.h"
DECLARE_CONVERTER_FACTORY(GenHeaderCnv);

Resolving references

The Data Model allows for object references and the I/O code needs to support persisting and restoring them. In general the Data Model will reference an object by pointer while the persistent class must reference an object by an index into some container. To convert pointers to indices and back, the converter must have access to the transient data and the persistent container.

Converting references can be additionally complicated when an object held by one HeaderObject references an object held by another HeaderObject. In this case the converter of the first must be able to look up the converter of the second and obtain its persistent object. This can be done as illustrated in the following example:

#include "Event/SimHeader.h"
#include "PerSimEvent/PerSimHeader.h"
StatusCode ElecHeaderCnv::initialize()
{
    MsgStream log(msgSvc(), "ElecHeaderCnv::initialize");

    StatusCode sc = RootIOBaseCnv::initialize();
    if (sc.isFailure()) return sc;

    if (m_perSimHeader) return StatusCode::SUCCESS;

    RootIOBaseCnv* other = this->otherConverter(SimHeader::classID());
    if (!other) return StatusCode::FAILURE;

    const RootIOBaseObject* base = other->getBaseObject();
    if (!base) return StatusCode::FAILURE;

    const PerSimHeader* pgh = dynamic_cast<const PerSimHeader*>(base);
    if (!pgh) return StatusCode::FAILURE;

    m_perSimHeader = pgh;

    return StatusCode::SUCCESS;
}

A few points:

  • This done in initialize() as the pointer to the persistent object we get in the end will not change throughout the life of the job so it can be cached by the converter.
  • It is important to call the base class’s initialize() method as on line 7.
  • Next, get the other converter is looked up by class ID number on line 12.
  • Its persistent object, as a RootIOBaseObj is found and dynamic_cast to the concrete class on lines 15 and 18.
  • Finally it is stored in a data member for later use during conversion at line 21.

CMT requirements File

The CMT requirements file needs:

  • Usual list of use lines
  • Define the headers and linker library for the public data classes
  • Define the component library
  • Define the dictionary for the public data classes

Here is the example for PerGenEvent:

package PerGenEvent
version v0

use Context    v*     DataModel
use BaseEvent    v*     DataModel
use GenEvent    v*     DataModel
use ROOT       v*     LCG_Interfaces
use CLHEP      v*     LCG_Interfaces
use PerBaseEvent v*     RootIO

# public code
include_dirs $(PERGENEVENTROOT)
apply_pattern install_more_includes more="PerGenEvent"
library PerGenEventLib lib/*.cc
apply_pattern linker_library library=PerGenEventLib

# component code
library PerGenEvent components/*.cc
apply_pattern component_library library=PerGenEvent


# dictionary for persistent classes
apply_pattern reflex_dictionary dictionary=PerGenEvent \
              headerfiles=$(PERGENEVENTROOT)/dict/headers.h \
              selectionfile=../dict/classes.xml

Footnotes

[1]Correct filling of the Archive Event Service is only guaranteed when using simple forward navigation.
[2]FIXME This needs to be described in the Data Model chapter and a reference added here
[3]This is actually done in RootIOTest/DybStorageAlg