Using RDataFrame's Book method to write tree

I’m trying to use RDataFrame’s Book method to fill and write a tree and I’m running in a few issues. I need to use the Book method since the columns in the input tree are custom classes and the user should be able to configure the logic used to fill the output branches (which are defined by the user as well).

When I try to use simple variables like integers or doubles as branches, things work fine, and I can merge the trees together in the Finalize function without issues. But when I try to use a std::vector<double> as a branch, the results in the tree are all empty (size 0), and the number of entries in the final tree doesn’t agree with the sum of entries in the trees created by the slots/workers. And when I try to use a c-style array of doubles, the merging works (number of entries agree), but the resulting entries in the tree are all zero (meaning the size of the array is correct, but each array member is zero).

The interesting thing when trying to merge trees with vectors as branches is that the resulting tree’s number of entries corresponds to the sum of the entries of all trees minus the second to last. This doesn’t change when changing the number of workers (except of course if it’s just one worker).

I’ve also tried different methods of merging the trees:

  • adding all trees to a TList and using TTree::MergeTrees(TList*)
  • adding the trees of all slots with slot IDs > 0 to a TList and using Merge(TList*) for the tree of slot 0
  • using CopyEntries(TTree*) for the tree of slot 0 to copy the entries of all other trees

All of these seam to have the same result.

Is there any reason why using std::vector<double> as a branch would not work when using RDataFrame?

Hi @vaubee ,

Not really, e.g. RDataFrame(10).Define("x", "std::vector<double>{1,2,3}").Snapshot("t","f.root") works.

My first suggestion would be to try and see whether you really cannot use Snapshot, because multi-thread TTree writing and merging is not trivial in general. If that’s not an option you can check Snapshot’s code to see how it is done there (the workhorse is actually TBufferMerger).

Cheers,
Enrico

I will try and use Snapshot once I find time for it.

EDIT:
I wrote a program that uses Define to fill a vector of a vector of doubles based on four branches (each a custom class) and then Snapshot to write this to a file.
When I try to run it, it ends with a std::system_error exception and this stack trace:

#0  0x00007fffea99237f in raise () from /lib64/libc.so.6
#1  0x00007fffea97cdb5 in abort () from /lib64/libc.so.6
#2  0x00007fffeb56a09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6
#3  0x00007fffeb57053c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007fffeb570597 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007fffeb57084d in __cxa_rethrow () from /lib64/libstdc++.so.6
#6  0x00007ffff37147ab in ROOT::Detail::RDF::RLoopManager::RunTreeReader() [clone .cold.800] () from /opt/cern/root/root_v6.26.00/lib/libROOTDataFrame.so
#7  0x00007ffff376756d in ROOT::Detail::RDF::RLoopManager::Run() () from /opt/cern/root/root_v6.26.00/lib/libROOTDataFrame.so
#8  0x0000000000418a6f in ROOT::RDF::RResultPtr<ROOT::RDataFrame>::TriggerRun (this=<synthetic pointer>) at /opt/cern/root/root_v6.26.00/include/ROOT/RResultPtr.hxx:381
#9  ROOT::RDF::RResultPtr<ROOT::RDataFrame>::Get (this=<synthetic pointer>) at /opt/cern/root/root_v6.26.00/include/ROOT/RResultPtr.hxx:169
#10 ROOT::RDF::RResultPtr<ROOT::RDataFrame>::operator* (this=<synthetic pointer>) at /opt/cern/root/root_v6.26.00/include/ROOT/RResultPtr.hxx:229
#11 ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void>::Snapshot (this=this@entry=0x7fffffffcc30, treename=..., filename=..., columnList=std::vector of length 1, capacity 1 = {...}, options=...)
    at /opt/cern/root/root_v6.26.00/include/ROOT/RDF/RInterface.hxx:1132
#12 0x00000000004115e3 in ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void>::Snapshot (options=..., columnList=..., filename=..., treename=..., this=0x7fffffffcc30) at /usr/include/c++/8/ext/new_allocator.h:86
#13 main (argc=<optimized out>, argv=<optimized out>) at myAnalysis/grsiTree.cxx:181

The line it fails on is where I call Snapshot.

The function that fills the vector of vectors creates the vector of vectors, then a vector inside a loop, and pushes that vector back onto the result, which is returned after the loop. I even simplified it for testing to:

std::vector<std::vector<double>> MyFunction(Class1& cl1, Class2& cl2, Class3& cl3, Class4& cl4)
{
   std::vector<std::vector<double>> res;
   
   for(auto i = 0; i < 2; ++i) {
      std::vector<double> tmp(3, 0.);
               res.push_back(tmp);
   }
   return res;
}

and it still fails.

The call for this is of the form

ROOT::RDataFrame frame(chain);
auto bgdef = frame.Define("myColumn", myFunction, {"col1", "col2", "col3", "col4"});
auto bgsnap = bgdef.Snapshot("timebg", outputFileName, {"myColumn"});

I double checked that col1, col2, col3, and col4 exist in the input file.

Hi @vaubee ,

writing vector<vector<double>> also works with Snapshot:

#include <ROOT/RDataFrame.hxx>
#include <atomic>
#include <iostream>

std::vector<std::vector<double>> MyFunction() {
  std::vector<std::vector<double>> res;

  for (auto i = 0; i < 2; ++i) {
    std::vector<double> tmp{i * 3., i * 3. + 1., i * 3. + 2.};
    res.push_back(tmp);
  }
  return res;
}

int main() {
  ROOT::RDataFrame df(100);
  auto bgdef = df.Define("myColumn", MyFunction);
  auto bgsnap = bgdef.Snapshot("t", "f.root", {"myColumn"});
}

If you run the code above and then check the output file e.g. with root -l -b -q f.root -e 'ROOT::RDataFrame(*t).Display()->Print()' you should see:

root [0]
Attaching file f.root as _file0...
(TFile *) 0x5581dee34ce0
+-----+-------------------------------------+
| Row | myColumn                            |
+-----+-------------------------------------+
| 0   | { 0.0000000, 1.0000000, 2.0000000 } |
|     | { 3.0000000, 4.0000000, 5.0000000 } |
+-----+-------------------------------------+
| 1   | { 0.0000000, 1.0000000, 2.0000000 } |
|     | { 3.0000000, 4.0000000, 5.0000000 } |
+-----+-------------------------------------+
| 2   | { 0.0000000, 1.0000000, 2.0000000 } |
|     | { 3.0000000, 4.0000000, 5.0000000 } |
+-----+-------------------------------------+
| 3   | { 0.0000000, 1.0000000, 2.0000000 } |
|     | { 3.0000000, 4.0000000, 5.0000000 } |
+-----+-------------------------------------+
| 4   | { 0.0000000, 1.0000000, 2.0000000 } |
|     | { 3.0000000, 4.0000000, 5.0000000 } |
+-----+-------------------------------------+

I do not know where the system_error might come from (RDataFrame never throws system_errors directly, but it’s probably re-throwing an exception that happens in the event loop). Doesn’t the exception come with an error message?

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.