What's the correct way to write TTreeReaderValue<> into a new TTree?

EdwinYZ · June 20, 2024, 2:36pm

Hi,

I’m trying to write a TTree by reading the data from another TTree. We have our own algorithms to decide which branch and which entries should be written to the new output tree.

As is suggested, we are also using the TTreeReader for the data reading:

auto input_file = TFile{ "input.root", "read"};
auto output_file = TFile {"output.root", "recreate"};

auto* tree_data = input_file.Get<TTree>{"TreeName"};
auto tree_reader = TTreeReader{};
tree_reader.SetTree(tree_data);

auto output_tree = TTree{"OutputTreeName"};

auto my_branch = TTreeReaderValue<MyBranch>{ tree_reader, "MyBranchName" };
auto check_branch = TTreeReaderValue<double>{ tree_reader, "CheckBranchName" };

tree_reader.Next();

// Must be after calling Next(). Otherwise it doesn't work.
output_tree.Branch(my_branch.GetBranchName(), my_branch.Get());

while(tree_reader.Next())
{
    if(not algorithm_check(check_branch)) continue;
    *my_branch; // Even though my_branch value is not needed, it must be dereferenced.
    output_tree.Fill();
}

output_file.WriteObject<TTree>(&output_tree, outptu_tree.GetName());

Here are some caveats in the code above:

my_branch must be dereferenced every time before calling the Fill of the output tree. Otherwise, the same value will be filled every time. What’s even worse is the compiler could optimize this line away if it sees the value is dereferenced but never used.
It’s very slow. The branch with “MyBranchName” has to be deserialized, copied and serialized again while its deserialized values are never used.
Registering the branch to the output tree must happen after called Next() of the tree reader. Otherwise, the address obtained from the reader is just nullptr. I know sometimes the lazy operation could make the code faster but it shouldn’t make the code illogical.
No multithreading.

Other alternatives seem even worse:

Using native APIs of TTree has to deal with T** for any user defined class.
RDataFrame seems to have very nice APIs. But, in practice, it’s super hard to be integrated into a large event-driven C++ code base.

Reading the data from input root files and output some entries to another root file could be very common. I would really appreciate it if ROOT devs could suggest a better way in terms of performance and simplicity.

Thanks for your attention.

ROOT Version: 6.28
Platform: Debian buster
Compiler: gcc 13

Danilo · June 20, 2024, 6:12pm

Hi,

Let me try to reply to your 3 fist points

my_branch must be dereferenced every time before calling the Fill of the output tree. Otherwise, the same value will be filled every time. What’s even worse is the compiler could optimize this line away if it sees the value is dereferenced but never used.

This is correct, the access is lazy on purpose not to decompress and deserialise when not needed (note, the decompression is the expensive part). The operations behind that simple dereferencing are such that the compiler would have a hard time optimizing them away. Of course, let us know if you have evidence of the contrary!

It’s very slow. The branch with “MyBranchName” has to be deserialized, copied and serialized again while its deserialized values are never used.

It would be interesting to see some profiling and understand what expectations are. In presence of two columns, one used to filter entries and one to be kept on disk, it is not possible to skip decompression of clusters of entries (entries are not compressed individually in the TTree columnar format) to read and write only selected entries. The cost of deserialisation is in general not very high.

Registering the branch to the output tree must happen after called Next() of the tree reader. Otherwise, the address obtained from the reader is just nullptr. I know sometimes the lazy operation could make the code faster but it shouldn’t make the code illogical.

Indeed, operations are lazy. Please let us know if you have suggestions for this particular usecase!

No multithreading.

Not exactly. I understand you want to have full control on the event loop (see comment about RDF below): please correct me if I am wrong. In this case you can leverage that multiple threads can safely open files and process trees in those files. Such an approach requires to split the work among threads in a proper way, but can be done, and it’s done by several users (and LHC experiments).

Using native APIs of TTree has to deal with T** for any user defined class.

Clearly the TTree API is more sophisticated than analysis oriented APIs such as TTreeReader or RDF, however is very powerful and suited by frameworks and large event-driven C++ code bases. About the usage of T**, I am not sure I understand fully what you refer to and the problem which is bothering you. It is a low level interface which is battle tested since many years: as such, it is not reflecting most recent C++ standards. It is also for this reason that ROOT is developing, also in collaboration with (LHC) experiments RNTuple [1, 2].

RDataFrame seems to have very nice APIs. But, in practice, it’s super hard to be integrated into a large event-driven C++ code base.

Indeed. RDF hides the event loop from the user to apply internally all optimisations - if your problem requires you to manage the event loop, then it might not be perfect. However, it remains very powerful and would give you the opportunity (and guarantee) to apply filtering and writing of data in the most efficient way ROOT allows. Multithreading would then come for free and would be managed internally by RDF.

Reading the data from input root files and output some entries to another root file could be very common.

Yes, you are completely right. Some call it skimming and it’s one kind of centrally managed jobs which are submitted on the WLCG by all experiments copiously since many years, able to stress network and mass storage backends. So far, ROOT managed to support this kind of workflow quite well and will continue to do so with RNTuple in the years to come.

Cheers,
Danilo

EdwinYZ · June 21, 2024, 11:37am

Thanks very much for your detailed explanations.

This is correct, the access is lazy on purpose not to decompress and deserialise when not needed (note, the decompression is the expensive part). The operations behind that simple dereferencing are such that the compiler would have a hard time optimizing them away. Of course, let us know if you have evidence of the contrary!

I see. But are there any other way to load the branch data to the memory apart from “deference”? It would be very nice if there exists a function like Load(), that loads the data to the memory.

Indeed, operations are lazy. Please let us know if you have suggestions for this particular usecase!

Yes, it would be nice if TTreeReader::Get() can return a valid address of the branch variable. I guess it’s possible since all we need to do is initialize a variable and return its address?

About the usage of T** , I am not sure I understand fully what you refer to and the problem which is bothering you.

I could explain this problem with more details:

auto input_file = TFile{ "input.root", "read" };
auto* tree = input_file.Get<TTree>("tree_name");

// create a variable for reading the branch
auto my_branch = std::make_unique<MyBranch>();
auto* my_branch_ptr = my_branch.get();
tree->SetBranchAddress("my_branch_name", &my_branch_ptr);

// load the entry and read branch variable...

So here you see I must pass the pointer to the pointer to MyBranch, if MyBranch is a user defined struct. What’s more important is the life time of my_branch_ptr shouldn’t end before completing reading the branch. In our production code, it happened quite often that we called SetBranchAddress in a function and the variable my_branch_ptr went out of scope. In the end, we either got seg fault or wrong values. For us, this is the biggest problem when reading a tree.

Indeed. RDF hides the event loop from the user to apply internally all optimisations - if your problem requires you to manage the event loop, then it might not be perfect. However, it remains very powerful and would give you the opportunity (and guarantee) to apply filtering and writing of data in the most efficient way ROOT allows. Multithreading would then come for free and would be managed internally by RDF.

Yes, you are right. I use RDF every time in the end of the pipeline when I just need to check out the data in a root file in a very cheap and quick way. But how does RDF work with TTree writing? I didn’t see any such example in the RDF webpage.

system · July 5, 2024, 11:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.