Tree with missing branch values for some events

Hi, I have a data model with two objects A and B describing my event. These objects may or may not be present for each event; for example, one of them could be a particle track which for untracked events is not present. I would like to save A and B in different branches of a TTree, but as far as I know in this way I must record both A and B for each event, in order to keep the branch entries aligned with the event number.
Is there a way to fill a “null value” for a given branch for those events which do not have a value for the content of that branch? If so, how do I recognize a null value when reading the event back?
Thanks.

Hi,

You have two ways to do so.

Either create a struct holding a pointer to an A and a pointer to a B and create a split branch for this struct (so that the A and B will be in their own branches … but will not be split themselves).

Or create two branches contains std::vector and std::vector which those vector containing 0 or 1 element.
(in which case the A and B can be split).

Cheers,
Philippe.

Thanks Philippe, actually your solutions work for my example but the example is just a simplified version of my real problem. Anyway, I stubled on this post where a similar problem was posed and you proposed a solution leveraging friend trees and TTree::BuildIndex. This made me think about this possibility:

  1. create one tree per object, with a single branch holding that object. These trees have different number of entries, since for each event only a subset of objects may be available (and this subset can change from event to event)
  2. create an index tree with an entry for each event; this tree has one branch containing a list of integer indexes equal to the entry number in object trees of objects belonging to this event; a -1 index for a given object tree would mean that that object is not available for the current event

This is simple enough to guess that it would work even before having tried it, and would solve my problem and make me happy. I have only one concern: data will be split among different trees in the same file, and I read here that having one tree per object might impact I/O performance severely. That post is quite old but I assume it still holds, right? If so, maybe I can use a single tree with an index branch containing one entry per event and different branches for objects with different number of entries, to be read one by one according to index values for current event. But I fear that I/O problems will still be present with this approach. I don’t know how the Root I/O works at low level, so I’d need some guidance on this. Thanks.

Hi Nicola.

In order for things to make sense (especially for human brains :slight_smile: ), you will need to store in each tree the event number along side the objects (so there is data duplication there). Comparatively, the vector options also increase the data size by having to store the ‘vector-size’ for each entry. So in term of pure data size it is similar. The multiple TTree is still a big larger as it needs to duplicate the TTree structure itself.

The real difference might in performance. TTree are optimized to be read sequentially (reading out of order will lead to waster of cpu by redoing the same uncompression multiple time). So as long as entries in all the TTrees are in the same order, you would be fine there too.

The last difference will be seen mostly if reading the file remotely. In order to have efficient read over low latency links, the TTree prefetched many entries at a time (in a TTreeCache by default 32Mb in memory size). When having multiple TTree, you can preserve performance by keeping the TTreeCache at its default size or you can preserve memory by reducing the size the TTreeCache. In the first case you end up using for caching 32Mb * the number of TTrees. In the second case you would set the size to 32Mb divided by the number of TTrees but this would result in that many more remote read (i.e. possibly much lower performance over low latency links).

Cheers,
Philippe.

Hi, storing an extra integer for each object is not a big deal for me, since the objects will be quite fat and an extra int is just a small relative size overhead. In my solution I do not store the event number in object trees, but rather I store the entry number for each object tree corresponding to the current event in what I called the index tree. So for example if event 0 has only object0 and event 1 has object0 and object1 the indexes for event 0 will be (0, -1) and for event 1 (1, 0): event 0 is associated with entry 0 of object0 tree and no entry of object1 tree, while event 1 is associated with entry 1 of object0 tree and entry 0 of object1 tree. After reading the indexes for current event I read each object tree using the associated index for current event
Actually, I put together a solution with a single tree with one index branch (one entry per event) and many object branches with different number of entries; each branch is read separately, i.e. no call to TTree::GetEntry is made, but just many TBranch::GetEntry with different indexes. I don’t know if this solution would suffer from similar caching problems of multiple trees, but I’ll do some experiments.
Thanks again.

The challenge with both of your latest solution is that they preclude using the precanned tools (for example TTree::Draw) that assume that the same entry in two branches are related and using an index which assume that there is one branch in both TTree which a ‘key’ indicating the match.

The Caching issue will be worse with your 2nd option (one unbalance TTree) as the TTreeCache will prefetch the same entries for all the branches but you will end up reading entries outside of this range (for example: at some point the cache would read entry 120 through 150 of all (enabled) branches but then because one of the branch is more populated than other, let say that its corresponding entry is 180, when doing GetEntry on that branch the TTreeCache will think that we are done with this range, throw it out and get the range containing 180 (let’s 180 through 210) if you then read one of the less populated branch, let’s say entry 140, its GetEntry will for the TTreeCache to drop the 180-210 range and read (again) the 120-150 range …

So I think the only viable option are the vector case or the event-number+obj in multiple friend TTrees case.

Cheers,
Philippe.

It seems that reading branches one by one is always a bad idea, even for balanced trees. I produced a tree with three balanced branches and 1000 events, and printed the readout figures with file->GetBytesRead() and file->GetReadCalls(). After reading each event with a single TTree::GetEntry I get:

Read 2526460 bytes in 7 transactions

while with one TBranch::GetEntry for each branch I obtain:

Read 2526460 bytes in 139 transactions

So it seems that the only way for me is to use one tree per object (the vector case would not work in my real world case, just in the simplified example of my 1st post), which is far from optimal anyway. Missing event objects would have been very useful for me but these performance issues of the Root I/O in this case cannot be fixed in a satisfactory way, it seems. I’ll try to redesign my data model to avoid missing objects.
Thanks Philippe for your support and advices.

I came up with another solution. The first time an object is missing I build a default one using TClass::New, and set the address of this default object every time the real data object is missing before calling TTree::Fill. In this way the tree will have balanced branches, and no read problems.

If I understand correctly, the compression algorithm will pack default objects so that the size on disk should be optimized; from a quick check, the file size when a random 50% of events have a missing object is slightly greater than the mean size between 0% missing objects and a file without the branch containing missing objects (i.e. 100% missing objects without writing default objects). Does this sound correct?

To flag missing objects I use a flag branch for each object branch. The flag branch contains just a bool which is true if the object is not missing for current event.

I still have to make some more refined tests about disk occupancy to convince myself that many missing objects (written as default objects) will not inflate the file too much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.