Writing vector<Thing*> in memory to vector<Thing> in ROOT file without copy

marc1uk · February 6, 2025, 12:27pm

ROOT Version: 6.28.04
Platform: Rocky Linux 9.4
Compiler: g++ 11.4

Hello!

I’m writing a DAQ application that receives Hit, Waveform, PPS etc objects, and writes std::vectors of these to ROOT file.
To minimize data copying on the DAQ we work with pointers to these objects, so at point of write-out we have std::vector<Hit*> etc.

While ROOT can write such vectors to file directly, I think it would be better for files to contain std::vector<Hit> rather than std::vector<Hit*>.

Is there a simple way to do this conversion as part of the output file writing, without copying the data to a new vector?

My initial thought is a custom TStreamer, but it seems those are fraught with difficulty, and the fact that ROOT can read and write vector<Thing*> would suggest the underlying functionality of de-referencing pointed-to elements is already part of ROOT’s capabilities, and that I would be making the job needlessly complex.

Many thanks in advance!

jonas · February 6, 2025, 6:15pm

Hi!

My first thought is: why don’t you create the std::vector<Hit> to begin with then? I can’t follow the reasoning why you can’t use a vector of values if you want to avoid data copying. You “just” need to refactor your code such that the Hit is already constructed in-place in the vector. Or is that not an option?

Also, if depending on how lightweight the Hit class is (maybe it’s just some pointers itself?), you might consider std::moveing them into the vector.

Could any of this be an option?

It would also be interesting to hear the opinion of @pcanal, our IO expert.

marc1uk · February 6, 2025, 8:05pm

Hi Jonas.

Well, if that were an option, we would have been done it in the first place, right?
The specifics of why we can’t do that is getting into the weeds about how our DAQ works. We have a lot of devices sending hits over a network asynchronously, so we don’t just get all the data for one TTree entry in one nice chunk. In fact, copying the Hits out of the network messages would be one copy, then we need to combine the hits from multiple devices, which requires std::vector.push_back (since the number of hits in a readout isn’t known in advance) so may introduce multiple more copies, then there’s sorting, triggering, window building … etc etc. There’s a lot of shuffling around of data, testing has shown that pointers to hits are more efficient. “Just refactor your DAQ” is … maybe not the right priority
We could of course reduce it down to one copy (or std::move) to produce the final vector ready for writing, but if the mechanism for iterating over a vector and streaming the elements in-place to a TBuffer exists…it seems like that would be the most straightforward to do.

pcanal · February 7, 2025, 12:00am

The file layout for a vector<object*> and a vector<object> is different and thus there is no trivial conversion implemented.

A custom streamer could work and be ‘relatively’ simple if you store the vector by itself.

If the vector is store in a TTree the best solution is to implement a custom CollectionProxy for your vector (and this would work also for the direct case). This is not trivial to do but relatively straight forward. See TCollectionProxyInfo.h for the kind of things we may have to do to pull it off.

A third option, if the vector is saved as part of an object, is to write an I/O customization rule for reading (this would delay the copy until the reading of the file).

A fourth option is to write the element individually (but then reading will be a bit more complex).

then there’s sorting, triggering, window building … etc etc. There’s a lot of shuffling around of data, testing has shown that pointers to hits are more efficient. “Just refactor your DAQ” …

Technically there is a fifth (unlikely) option :). On the DAQ side you could hold on top the object in a std::list<Thing> for the main container and std::vector<Thing*> for things that are view.
The only reason why this is vaguely an option is once stored, a std::list<Thing> can be trivially (nothing to do on your side but changing the type) into a std::vector<Thing>

marc1uk · February 8, 2025, 12:00am

hi Philippe,
Thanks for helping out. I could be misunderstanding, but I’m not sure if any of these proposals result in std::vector<thing> on disk…?

The std::list suggestion is a good one, but unfortunately any container of Hit requires the data be copied - right now the Hit*s point to parts of a binary blob in a network message. If it’s going to be moved anywhere, it might as well be into a std::vector before writing.

A custom streamer could work and be ‘relatively’ simple if you store the vector by itself.

This is the most promising line! But I’m confused by the “if” qualifier … Do you mean if we’re storing a single-instance of vector<Hit> outside of a TTree with TDirectoryFile::WriteObject? The hope was to have this as a branch (with different branches for vector<Hit>, vector<Waveform>… ), so that an event maps to a TTree entry. But does presence in a TTree affect how the Streamer works?

The I/O customization rule seems like by far the simplest option, but I think it requires encapsulating the vectors in a class.

Regarding CollectionProxy: My naive first step was to generate a dictionary for vector<Hit*> with rootcling and then look at the automatically generated Streamer - which of course didn’t exist. This makes sense with the manual’s statement

instead of implementing dedicated streaming functions for std::vector, std::list, etc., as well as ROOT’s collection types, ROOT implements an abstraction layer for the required I/O functionality

The manual is sparse on details here, saying users may ‘implement the TVirtualCollectionProxy interface’… I’m not exactly sure what that means. Is the idea of TVirtualCollectionProxy that a user can inherit from this class and implement concrete methods for reading arbitrary data into varying types of stl containers…?
In such a case I would … encapsulate my std::vector<Hit*> into a wrapper class deriving from TVirtualCollectionProxy, which could then be read back into a std::vector<Hit> via suitable accessor functions…? i.e. the Tree branch on file would hold some class, but could be read with TBranch::SetAddress("name", std::vector<Hit>)?
While that sounds like it may give the desired end-user experience, I’m not sure how it works in practice: I still need to serialise my std::vector<Hit*> onto disk… is this just a more generalised version of an I/O customization rule?

pcanal · February 8, 2025, 5:30pm

A custom streamer would still work for a branch but would prevent the splitting of the content of the vector into sub-branches. Resulting in slightly less performant files and the inability to partially read the content of the vector (i.e. just read on the data member of the Hit).

I’m not exactly sure what that means. Is the idea of TVirtualCollectionProxy that a user can inherit from this class and implement concrete methods for reading arbitrary data into varying types of stl containers…?

Almost … it is not 'implement … for reading arbitrary data ’ but instead ‘implement method to traverse and fill the collection’

pcanal · February 8, 2025, 5:36pm

Regarding CollectionProxy: My naive first step was to generate a dictionary for vector<Hit*> with rootcling and then look at the automatically generated Streamer - which of course didn’t exist.

Indeed the part that register a collection Proxy is similar to:

instance.AdoptCollectionProxyInfo(TCollectionProxyInfo::Generate(TCollectionProxyInfo::Pushback< vector<TChainIndex::TChainIndexEntry> >()));

If we were going that route we would need to write a TCollectionProxy or TCollectionProxyInfo that tells the system that the vector<Hit*> should be seen as containing Hit and provice the proper accessor (which dereference the pointer per se).

If you want to go this route, the best is that I prepare a small example.

marc1uk · February 8, 2025, 10:33pm

The manual says splitting makes reading faster but writing slower, so I think non-splitting is better for our situation anyway. I also can’t imagine we would want part of a Hit/Waveform/whatever often - we already split the event into branches that can be turned on/off.

So it sounds like both solutions would be workable, but an example or more details on how to go about it in either case would be much appreciated. I suppose whichever is simpler to implement.

system · February 22, 2025, 10:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.