TDataFrame Snapshot a tree with embedded arrays

I am trying to use TDataFrame to make my life easy and reduce a datasample into a signal and background region. For this I use:

using namespace ROOT::Experimental;
TDataFrame df("Drell-Yan", "dy_tuple_12_md.root");
auto sig = df.Filter("Z0_ENDVERTEX_CHI2 < 5.");
auto bkg = df.Filter("Z0_ENDVERTEX_CHI2 > 15.");
sig.Snapshot("Drell-Yan", "dy_tuple_12_md_signal.root");
bkg.Snapshot("Drell-Yan", "dy_tuple_12_md_hf.root");

While this runs and produces the output files, these are not complete. The original tree contains some branches which are float arrays (of different lengths between events, but all of the same length within each event). In the output files these arrays are reduced to single float values (presumably the first values in each array).

Here is the relevant part of the tree structure (in the original file):

root [0] auto f = TFile::Open("dy_tuple_12_md.root");
root [1] auto t = (TTree*) f->Get("Drell-Yan");
root [2] t->Print()
******************************************************************************
*Tree    :Drell-Yan : candidates                                             *
*Entries :  6050086 : Total =      8148881413 bytes  File  Size = 5789740455 *
*        :          : Tree compression factor =   1.41                       *
******************************************************************************
*Br   35 :nTrack    : nTrack/I                                               *
*Entries :  6050086 : Total  Size=   24295310 bytes  File Size  =    8119769 *
*Baskets :      966 : Basket Size=      32000 bytes  Compression=   2.99     *
*............................................................................*
*Br   36 :tracks_PX : tracks_PX[nTrack]/F                                    *
*Entries :  6050086 : Total  Size=  802795979 bytes  File Size  =  729794020 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.10     *
*............................................................................*
*Br   37 :tracks_PY : tracks_PY[nTrack]/F                                    *
*Entries :  6050086 : Total  Size=  802795979 bytes  File Size  =  730225204 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.10     *
*............................................................................*
*Br   38 :tracks_PZ : tracks_PZ[nTrack]/F                                    *
*Entries :  6050086 : Total  Size=  802795979 bytes  File Size  =  703887339 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.14     *
*............................................................................*
*Br   39 :tracks_IP : tracks_IP[nTrack]/F                                    *
*Entries :  6050086 : Total  Size=  802795979 bytes  File Size  =  718673130 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.12     *
*............................................................................*
*Br   40 :tracks_IPCHI2 : tracks_IPCHI2[nTrack]/F                            *
*Entries :  6050086 : Total  Size=  802900823 bytes  File Size  =  729673037 *
*Baskets :    26098 : Basket Size=      32000 bytes  Compression=   1.10     *
*............................................................................*
*Br   41 :tracks_eta : tracks_eta[nTrack]/F                                  *
*Entries :  6050086 : Total  Size=  802822077 bytes  File Size  =  670883430 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.20     *
*............................................................................*
*Br   42 :tracks_phi : tracks_phi[nTrack]/F                                  *
*Entries :  6050086 : Total  Size=  802822077 bytes  File Size  =  720329161 *
*Baskets :    26094 : Basket Size=      32000 bytes  Compression=   1.11     *
*............................................................................*
*Br   43 :tracks_charge : tracks_charge[nTrack]/F                            *
*Entries :  6050086 : Total  Size=  802900823 bytes  File Size  =  112326444 *
*Baskets :    26098 : Basket Size=      32000 bytes  Compression=   7.14     *
*............................................................................*
*Br   44 :tracks_isMuon : tracks_isMuon[nTrack]/F                            *
*Entries :  6050086 : Total  Size=  802900823 bytes  File Size  =   55801458 *
*Baskets :    26098 : Basket Size=      32000 bytes  Compression=  14.38     *
*............................................................................*

And here the tree structure in one of the snapshot files:

root [3] auto f2 = TFile::Open("dy_tuple_12_md_signal.root");
root [4] auto t2 = (TTree*) f2->Get("Drell-Yan");
root [5] t2->Print()
******************************************************************************
*Tree    :Drell-Yan : Drell-Yan                                              *
*Entries :  2950024 : Total =       554821580 bytes  File  Size =  374513586 *
*        :          : Tree compression factor =   1.48                       *
******************************************************************************
*Br   35 :nTrack    : nTrack/I                                               *
*Entries :  2950024 : Total  Size=   11804510 bytes  File Size  =    3879113 *
*Baskets :       42 : Basket Size=    2874880 bytes  Compression=   3.04     *
*............................................................................*
*Br   36 :tracks_PX : tracks_PX/F                                            *
*Entries :  2950024 : Total  Size=   11804648 bytes  File Size  =   10888985 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.08     *
*............................................................................*
*Br   37 :tracks_PY : tracks_PY/F                                            *
*Entries :  2950024 : Total  Size=   11804648 bytes  File Size  =   10922629 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.08     *
*............................................................................*
*Br   38 :tracks_PZ : tracks_PZ/F                                            *
*Entries :  2950024 : Total  Size=   11804648 bytes  File Size  =   10555967 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.12     *
*............................................................................*
*Br   39 :tracks_IP : tracks_IP/F                                            *
*Entries :  2950024 : Total  Size=   11804648 bytes  File Size  =   10696010 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.10     *
*............................................................................*
*Br   40 :tracks_IPCHI2 : tracks_IPCHI2/F                                    *
*Entries :  2950024 : Total  Size=   11804937 bytes  File Size  =   10859592 *
*Baskets :       43 : Basket Size=    2875392 bytes  Compression=   1.09     *
*............................................................................*
*Br   41 :tracks_eta : tracks_eta/F                                          *
*Entries :  2950024 : Total  Size=   11804694 bytes  File Size  =    9876309 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.20     *
*............................................................................*
*Br   42 :tracks_phi : tracks_phi/F                                          *
*Entries :  2950024 : Total  Size=   11804694 bytes  File Size  =   10813645 *
*Baskets :       42 : Basket Size=    2875392 bytes  Compression=   1.09     *
*............................................................................*
*Br   43 :tracks_charge : tracks_charge/F                                    *
*Entries :  2950024 : Total  Size=   11804937 bytes  File Size  =    1313121 *
*Baskets :       43 : Basket Size=    2875392 bytes  Compression=   8.99     *
*............................................................................*
*Br   44 :tracks_isMuon : tracks_isMuon/F                                    *
*Entries :  2950024 : Total  Size=   11804937 bytes  File Size  =     506731 *
*Baskets :       43 : Basket Size=    2875392 bytes  Compression=  23.29     *
*............................................................................*

Is this just not supported yet? If so, a mention of this in the documentation would be nice…
I am currently using ROOT 6.10/02.

Naive question: Did you consider to use std::vector instead of float* and int the number of elements. Would it work?

i did not consider that (mostly because I could not anticipate that this would be a problem).

But since it takes about a week to reproduce the base sample, it is not something I want to try out just to see if it is different without at least some expectation of success…

Hi,
thank you for using TDataFrame, we need users like you to do things we did not think about and sometimes find out they do not work :sweat:

This is a bug, you can follow its progress in the jira issue I just opened here. I expect this to be fixed for the next 6.10 patch release.

1 Like

By the way, @mato is correct in saying that using a std::vector<float> instead of a c-style array would work (as in, std::vector is not affected by the bug). You can try yourself by playing with the reproducer I posted in the jira issue.

A workaround that does not require to reproduce your data is to convert your branches to a type that Snapshot understands (i.e. std::vector)…using Define and Snapshot :sweat_smile:

The following snippet creates a TTree with a branch of variable size (in create_tree) and then converts that branch to a std::vector and saves the converted branch to a file. You can do the conversion at the same time as you do the filtering, so you only loop over the data once and get out the filtered data in std::vector<float> format instead of float*.

I understand this is not ideal if you have hundreds of branches that you want to convert as the Defines add some runtime overhead and it’s not fun to explicitly write the name of all the branches you want to convert – we’ll still fix Snapshot to support c-style arrays :smiley: .

#include "ROOT/TDataFrame.hxx"
#include "TTree.h"
#include "TFile.h"
#include "ROOT/RArrayView.hxx"
#include <string>
#include <vector>
#include <iostream>

void create_tree() {
   TFile f("f.root", "RECREATE");
   TTree t("t", "t");
   const int maxsize = 10;
   int size;
   int a[maxsize];
   t.Branch("size", &size);
   t.Branch("a", a, "a[size]/I");
   size = 3;
   a[0] = 1; a[1] = 2; a[2] = 3;
   t.Fill();
   size = 2;
   a[0] = 4; a[1] = 5;
   t.Fill();
   t.Write();
   f.Close();
}

void snapshot_tree() {
  // a lambda that reads a branch of type int* and returns its contents as a std::vector<int>
  auto convertToVector = [](std::array_view<int> a) {
    const auto size = a.size();
    std::vector<int> v(size);
    for (auto i = 0u; i < size; ++i)
      v[i] = a[i];
    return v;
  };

  ROOT::Experimental::TDataFrame d("t", "f.root");
  d.Define("v", convertToVector, {"a"}).Snapshot("t", "out.root", "v");
}

int main() {
   create_tree();
   snapshot_tree();

   // check everything is as expected
   TFile f("out.root");
   TTree *t = nullptr;
   f.GetObject("t", t);
   t->Print();
   t->Scan("v");
   return 0;
}

Well, I have nine branches which are affected by this, as you can see in the tree Print.
That would still be an acceptable amount of typing.

However, in your example the new branch has a different name than the original branch. Is it possible to overwrite the name? The rest of the analysis code relies on the fact that the branch names stay the same…

Good to hear that I found another bug in TDataFrame, though :slight_smile:.

The answer to my question is no, one can not overwrite an existing branch:

d2.Define("tracks_PX", convertToFloatVector, {"tracks_PX"}).Snapshot("Drell-Yan", "/userhome/out.root", "v");
Error in <TRint::HandleTermInput()>: std::runtime_error caught: branch "tracks_PX" already present in TTree

Also, note that it would have been very hard for me to save these as std::vector in the beginning, because they come from a gaudipython script, where they are actually either Python array.array('f') or numpy.array objects…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

It took some time, but TDF now supports transparent Snapshotting of c-style array branches. master and the upcoming ROOT v6.12 contain the new feature.

Cheers,
Enrico

1 Like