How to use this (TDataFrame) method to quickly fill a histogram repeatedly in a loop?

Continuing the discussion from Speed up TTree Draw:

I need to fill up a single histogram h inside a loop repeatedly, every N minutes, like:

for (t = t1; t < t2; t=t+N)
{
TH1F *h = new TH1F("h","",nbin,min,max)
Tree->Draw("var1>>h","t>t-N/2 && t < t+N/2");
//do some stuff with h then
delete h;
}

The problem is that I have a very big file, spanning days of data, and at each loop, no matter what is the size of the time window (N here), it takes the same amount of (extremely long) time, kind of like going through the whole data-set on each loop. This topic seems to address my issue but it’s not completely clear to me how can I adopt this in my situation.

I will really appreciate any help.


_ROOT Version: 6.14/04
_Platform: Ubuntu 16.04
_Compiler: g++ 5.4.0


Hi,

I think RDataFrame can help here. In the code you show above, what are “t” and “N”? Variables in your tree or the “t” and “N” you use to steer the for loop?

Cheers,
D

Hi,

I decided to rush ahead and cover both cases :slight_smile:.
I assume var1 is of type float.

  1. t and N are variables in the for loop:
ROOT::EnableImplicitMT(); // <- this parallelises the internal loop on your huge dataset
ROOT::RDataFrame rdf("YourTreeName","path/to/your/files/*.root");
for (t = t1; t < t2; t=t+N) {
   auto filtFunc = [t,N](){ return (t > t - .5 * N) && (t < t + .5 * N);}
   auto h = rdf.Filter(filtFunc).Histo1D<float>({"name","title",nbin, min, max}, "var1");
   //do some stuff with h then
}
  1. t and N are variables in the TTree (I assume they are integers):
ROOT::EnableImplicitMT(); // <- this parallelises the internal loop on your huge dataset
ROOT::RDataFrame rdf("YourTreeName","path/to/your/files/*.root");
for (t = t1; t < t2; t=t+N) {
   auto filtFunc = [](int t, int N){ return (t > t - .5 * N) && (t < t + .5 * N);}
   auto h = rdf.Filter(filtFunc, {"t", "N"})
                     .Histo1D<float>({"name","title",nbin, min, max}, "var1");
   //do some stuff with h then
}

Cheers,
D

Hi Danilo,

Sorry for my incompleteness, reality is actually a mixed one, t is a tree variable, and N is a loop variable, both doubles. But I think, from what you’ve already provided, quite generously :), I can try this on my problem using a mix of both cases and get back to you tomorrow. Thanks a ton

Hi,

the hybrid would be (I extract the type of var1, use the right one):

ROOT::EnableImplicitMT(); // <- this parallelises the internal loop on your huge dataset
ROOT::RDataFrame rdf("YourTreeName","path/to/your/files/*.root");
using Var1_t = double;
for (t = t1; t < t2; t=t+N) {
   auto filtFunc = [N](double t){ return (t > t - .5 * N) && (t < t + .5 * N);}
   auto h = rdf.Filter(filtFunc, {"t"}).Histo1D<Var1_t>({"name","title",nbin, min, max}, "var1");
   //do some stuff with h then
}

Let us know how it goes, especially the time needed with and without the ROOT::EnableImplicitMT() line.
Cheers,
D

Hi Danilo,

thank you so much for your detailed reply, it should be enough for me to solve the problem but I didn’t anticipate the complexity, the situation seems a little bit more complicated. What I really have is not a single TTree variable, but a combination, two leafs from a branche, branch name is pid and the leaf names are day and sec, therefore what I need is a constraint on pid.day+pid.sec/86400. to be within a time window of t - N/2 and t + N/2, and the TTree variable, the leaf that I am filling is pid.var1, will your:

auto filtFunc = [N](double t){ return (t > t - .5 * N) && (t < t + .5 * N);}

can be changed to:

auto filtFunc = [N](int day, int sec){ return (pid.day+pid.sec/86400. > t - .5 * N) && (pid.day+pid.sec/86400. < t + .5 * N);}

Something tells me this won’t work, because of the defining, (int day, int sec), nothing is telling that these are leafs from the branch pid. But somehow, if that gets fixed, then:

auto h = rdf.Filter(filtFunc, {"t"}).Histo1D<Var1_t>({"name","title",nbin, min, max}, "var1");

changed to:

auto h = rdf.Filter(filtFunc, {"pid.day+pid.sec/86400."}).Histo1D<Var1_t>({"name","title",nbin, min, max}, "pid.var1");

Will this work?

Additional part, I am actually going to fill two TH1Fs h1 and h2 and add them to get a third TH1F let’s say H (this I can easily achieve by first defining H with the same binning and size and then successively adding h and h2 to it. h1, h2 will be filled with leafs var1 and var2 from two different branches pid1 and pid2 and constraints on 2 different variables will be used, on pid1.day+pid1.sec/86400. and pid1.day+pid1.sec/86400. ))

Please help me with that filFunc step.

-a

Hi,

did you try to write a filtFunc which takes in input a pid object of the type stored in the pid column of your tree?

Cheers,
D

What I tried so far:

auto filtFunc = [t, N](pid1 day, pid1 sec){return ( pid1.day + pid1.sec/86400.0 > t - .5 * N) && ( pid1.day + pid1.sec/86400.0 < t + .5 * N);};

it complained about not recognizing pid1 and pid2, so I changed ROOT::RDataFrame rdf("tree",file); to ROOT::RDataFrame rdf("tree",file, {"pid1","pid2"});

didn’t work, so I changed auto filtFunc = [t, N](pid1 day, pid1 sec){return ( pid1.day + pid1.sec/86400.0 > t - .5 * N) && ( pid1.day + pid1.sec/86400.0 < t + .5 * N);};```

to

auto filtFunc = [t, N](pid1 day, pid1 sec){return ( day + sec/86400.0 > t - .5 * N) && ( day + sec/86400.0 < t + .5 * N);};```

it still complains about pid1, 2 not been declared.

Hi,

did you set those names (pid1 and pid2) as the names of the columns?
D

Hi,

structure of my root file looks like this:

*********************************************************
*Tree    :tree     : tree                               *
*Br    0 :pid1      : var1/I:var2/I:day/I:sec/I         *
*Br    1 :pid2      : var1/I:var2/I:day/I:sec/I         *
*********************************************************

This is my code:

#include <ROOT/TDataFrame.hxx>
#include <TROOT.h>
#include <TH1.h>
#include <TF1.h>
#include <TFile.h>
#include <TTree.h>
#include <TBranch.h>
#include <TLeaf.h>

using namespace ROOT::Experimental; 

int main(int argc, char *argv[])
{

  ROOT::EnableImplicitMT();
  TFile *file = TFile::Open("/path/to/the/rootfile.root");
  file->GetObject("tree", tree);
  if (!tree) return 1;

 TLeaf *pid1_day, *pid1_sec, *pid1_var1,  *pid1_var2;
 TLeaf *pid2_day, *pid2_sec, *pid2_var1,  *pid2_var2;

  //I TRIED UNCOMMENTING THESE TOO, DIDN'T HELP
  //file->GetObject("pid1.var1", pid1_var1);
  //file->GetObject("pid1.var2", pid1_var2);
  //file->GetObject("pid1.day", pid1_day);
  //file->GetObject("pid1.sec", pid1_sec);
  //file->GetObject("pid2.var1", pid2_var1);
  //file->GetObject("pid2.var2", pid2_var2);
  //file->GetObject("pid2.day", pid2_day);
  //file->GetObject("pid2.sec", pid2_sec);
 


 double t;
 double t1 = 1.; //days
 double t2 = 10.;  //days
 int tBin = (t2-t1)*24*30; //every half an hour      
 double dt = 0.007; //days


 for(int i = 0; i < tBin; i++)
 {
        t = t1 + i*dt; 

        TH1F *H = new TH1F("H","H",300, 0, max);
        
        auto filtFunc1 = [t, dt](pid1 day, pid1 sec){
            return (day + sec/86400.0 > t - .5 * dt) && (day + sec/86400.0 < t + .5 * dt);
        };
        
        auto filtFunc2 = [t, dt](pid2 day, pid2 sec){
            return (day + sec/86400.0 > t - .5 * dt) && (day + sec/86400.0 < t + .5 * dt);
        };
        
        auto h1 = rdf.Filter(filtFunc1, {"pid1.day+pid1.sec/86400.0"}).Histo1D<var1>({"h1","h1",300, 0, max}, "16000-pid1.var1");
        auto h2 = rdf.Filter(filtFunc2, {"pid2.day+pid2.sec/86400.0"}).Histo1D<var2>({"h2","h2",300, 0, max}, "16000-pid2.var2");


        H->Add(h1); 
        H->Add(h2); 

        TF1 *gaus1 = new TF1("gaus1","gaus", 0, max);
        H->Fit(gaus,"Q");
        H->GetParameter(1);

        delete H;
        delete gaus1;

       
  }



return 0;

Hi,

you then need to read leaves individually. In your latest code the data frame is not defined.
I propose you solve the simple case and then go for the full program. For example, reading pid.var1,pid.day and pid.sec and printing them.
Can you make the code below work?

ROOT::RDataFrame rdf("YourTreeName","path/to/your/files/*.root");
rdf.Foreach([](int v, int d, int s){cout << v << " " << d << " "<< s << endl}, {"pid.varr1", "pid.day", "pid.sec"});

Cheers,
D

Hi,

I am getting the following errors:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Unknown columns: pid1.var1,pid1.day,pid1.sec
Aborted (core dumped)

Hi,

could you share the file so that I can debug locally?

Cheers,
D

Hi, sorry, I used a wrong file! Now it’s working :grinning: I can read the leafs.

Ok perfect. Now add piece by piece the final program you pasted above and it should work :slight_smile:

D

Hi,

as also

rdf.Filter(Form("pid1.var1>%u",0)).Foreach([](int v, int d, int s){std::cout << v << "\t " << d << "\t "<< s << "\n";}, {"pid1.var1", "pid1.day", "pid1.sec"});

worked for me (direct filter), I thought I can use this to directly fill, skipping the separate filter function definition, using the following method:

auto h1 = rdf.Filter(Form("pid1.day + pid.sec/86400.0 < %g && pid1.day + pid.sec/86400.0 > %g", t + dt * 0.5, t - dt * 0.5)).Histo1D<var1>({"h1","h1",300, 0, max}, "pid1.var1");

This gives me the following error:

no matching function for call to ‘ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter, void>::Histo1D(<brace-enclosed initializer list>, const char [11])’

My guess is .Histo1D<var1> is causing this, I tried changing that to .Histo1D<pid1.var1> didn’t work! I tried, .Histo1D<"var1">, didn’t work! Then I tried .Histo1D<"pid1.var1">, didn’t work. Also tried, declaring var1 earlier as an int, didn’t work either.

What am I missing :thinking:

Hi,

can you try to rewrite the original program?

D

Thanks a lot.

I am trying that now, but before that, I tried to use: using var1_t = Int_t and then fill that using Histo1D<var1_t>, this seems to be working! :grinning:

But one little problem, I am unable to add this histogram to another. The error is:

no matching function for call to ‘TH1F::Add(ROOT::RDF::RResultPtr<TH1D>&)’

once I succeed adding these auto histograms h1 and h2, I will see if TF1 fit works on them and that will be all.

Finally, cloning has solved my Add problem!

TH1D *h1_1 = (TH1D*)h1.GetValue().Clone("h1_1");
TH1D *h2_1 = (TH1D*)h2.GetValue().Clone("h2_1");

Now they can be added.

Next challange is to see whether this method really is faster or not! So far, on a small file, I haven’t seen magical results. I will use both methods and will get back to you soon.

Hi,

I was away so couldn’t report earlier. There are things that I want to report:

  1. I still see a linear relationship between time consumption and the size of the data (in time length) 375MB file took about 4.6 seconds, 2901 MB took 31.75 seconds, for the same time interval of 10 minutes in the Filter condition (~90 MB/sec). But it is indeed significantly faster than the usual Tree->Draw() method, about 150% faster to be more precise (~65 MB/sec), on my computer. This is great news, but I need a faster solution as my real file will be ~40GB, which means every loop (every 10 minutes) will take 7 hours and hence it will take forever to scan the whole timeline. Entry by entry (Fill->(x)) method with a proper alignment of loop time and the time column inside the root file, seems like the best way to proceed now so that I don’t need to look at the whole root file in every loop.

  2. This method works for the first 2 loops and then it does not fill the histograms, everything seems empty. On the other hand, Tree->Draw() does not show such an issue.

I will really appreciate if you could help me understand point # 2 for now.

Thanks
-q