Getting value from TBranch is extremely slow

@Wile_E_Coyote @pcanal

Do you have any solution for the first post of this thread?

I have 105 branches as you will be able to check in the ROOT file which I have shared earlier as well.

I have tried it using RDataFrame as suggested, but somehow I am not able to get it worked!

Any guidance/help is highly appreciated.

Hi @ajaydeo ,
the error

Error in <TFile::TFile>: file /home/ajay/Research/IUAC_Experiments_July2021/DATA/Source/Misc/eu152_11july_afterGlitch.002/RoseNIAS does not exist

is not because of your build, it’s a bug with multi-thread RDF + filenames that don’t end in .root.

This works (no EnableImplictMT):

#include "TChain.h"
#include "TFile.h"
#include "TH1.h"
#include "TH2.h"
#include "TLeaf.h"
#include "TTree.h"
#include <ROOT/RDataFrame.hxx>

#define UpdateEvents 2000

// using namespace std;

void GetValue(const char *filename = "eu152_11july_afterGlitch.002",
              const char *treename = "RoseNIAS") {
  // ROOT::EnableImplicitMT();

  const char *nmHis[20];
  Int_t NoPara;
  Long64_t nentries;

  TChain *fChain = new TChain(treename);
  fChain->AddFile(filename);
  TObjArray *obj = fChain->GetListOfBranches();
  NoPara =
      obj->GetEntries(); // Number of parameters in Tree and hence in experiment
  nentries = fChain->GetEntriesFast(); // Number of events in the Tree

  ROOT::RDataFrame df(treename, filename);
  std::vector<ROOT::RDF::RResultPtr<TH1D>> histos;
  for (const std::string &col : df.GetColumnNames()) {
    auto h = df.Histo1D<double>(col);
    histos.push_back(h);
  }

  // need a DrawClone because the histograms go out of scope at the end of the
  // function alternatively, return `histos` from the function
  histos[0]->DrawClone();
  histos[0]->GetMean();
}

I ran it as root -l GetValue.C+ (the + is to compile the code with optimizations rather than running it directly through the interpreter without optimizations).

To run with multi-threading enabled, renaming the file or creating a symlink with the .root extension is currently a workaround, but I’ll merge a fix in ROOT master branch in the next days. I opened an issue, [DF] Cannot read files that don't have a `.root` extension with IMT on · Issue #8739 · root-project/root · GitHub , to keep track of this.

Cheers,
Enrico

1 Like

@eguiraud

Thank you very much for your time and help.

As suggested, I created a symlink to the original file with and added “.root” extension.
I confirm that this works (only when executed as GetValue.C+ or GetValue.C++ at the ROOT prompt), and I clearly see that the multi-threading works.

However, it gives a blank histogram when e.g. histos[0]->DrawClone(); is executed!

Also, I was expecting that all the histograms are stored in the histos vector at the end of execution of GetValue.C so that I could draw any desired histogram using e.g. histos[37]->DrawClone();

With this, I get the following error:

root [1] histos[37]->DrawClone();
input_line_55:2:3: error: use of undeclared identifier 'histos'
 (histos[37]->DrawClone())
  ^
Error in <HandleInterpreterException>: Error evaluating expression (histos[37]->DrawClone()).
Execution of your code was aborted.
root [2] 

I have tried this with conda as well, but still get the same error! What am I doing wrong?

Regards,

Ajay

Draw might give a blank histogram, DrawClone should not (and it does not when I execute the code I shared above). Also see the comment before the DrawClone call in the last code I shared for more context.

This and the previous problem with Draw have the same cause: histos is a variable local to the GetValue function. At the end of the function, the histos variable is destroyed together with the histograms.

There is a simple solution: you can return the histograms from the function. For example, with the following macro:

#include "TChain.h"
#include "TFile.h"
#include "TH1.h"
#include "TH2.h"
#include "TLeaf.h"
#include "TTree.h"
#include <ROOT/RDataFrame.hxx>

#define UpdateEvents 2000

std::vector<ROOT::RDF::RResultPtr<TH1D>> GetValue(const char *filename = "eu152_11july_afterGlitch.002.root",
              const char *treename = "RoseNIAS") {
  ROOT::EnableImplicitMT();

  const char *nmHis[20];
  Int_t NoPara;
  Long64_t nentries;

  TChain *fChain = new TChain(treename);
  fChain->AddFile(filename);
  TObjArray *obj = fChain->GetListOfBranches();
  NoPara =
      obj->GetEntries(); // Number of parameters in Tree and hence in experiment
  nentries = fChain->GetEntriesFast(); // Number of events in the Tree

  ROOT::RDataFrame df(treename, filename);
  std::vector<ROOT::RDF::RResultPtr<TH1D>> histos;
  for (const std::string &col : df.GetColumnNames()) {
    auto h = df.Histo1D<double>(col);
    histos.push_back(h);
  }

  return histos;
}

this works:

root [0] .L GetValue.C+ // load your macro instead of executing it
Info in <TUnixSystem::ACLiC>: creating shared library /home/blue/Downloads/./GetValue_C.so
auto h = GetValue()root [1] auto histos = GetValue(); // call the function and store the resulting histos
root [2] histos[0]->Draw() // use the histograms as usual -- computation is triggered here
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
root [3]

Cheers,
Enrico

1 Like

@eguiraud

Thank you! It works as expected and I can retrieve all the histograms through the histos vector.

To move little further, I added

ROOT::RDF::TH1DModel hmod1D("h", "h", 16384, 0, 16384);

for custom binning just below the histos definition in your code. This works and gives the histogram which we normally see from our detectors, BUT, of course as expected, all the histograms’ name is “h”. I want their actual name e.g. “CL_01_E01”, with custom binning. For this, I tried:

ROOT::RDF::TH1DModel hmod1D(df.GetColumnNames(), df.GetColumnNames(), 16384, 0, 16384);

But this doesn’t work! I searched through the ROOT Forum but couldn’t find relevant post which describes this issue.

Also, I want to fill histograms ONLY IF the bin number is greater than say 10, initially for all the histograms and then for the selected histograms e.g. with names similar to “CL_01_E01”, “CL_02_E03” etc. Can this be done easily?

I know there is Filter option but couldn’t find how to implement it using BinNumber.

With best regards,

Ajay

That can’t work, GetColumnNames() returns a vector and the TH1DModel constructor expects a single string. You can do this:

for (const std::string &col : df.GetColumnNames()) {
    ROOT::RDF::TH1DModel hmod1D(col, col, 16384, 0, 16384);
    auto h = df.Histo1D<double>(hmod1D, col);
}

Would changing the binning so that the first 10 bins are excluded work?

Hi @eguiraud,

ROOT::RDF::TH1DModel hmod1D(col, col, 16384, 0, 16384);

:point_up_2: does not work! Changing it to:

ROOT::RDF::TH1DModel hmod1D(col.c_str(), col.c_str(), 16384, 0, 16384);

works! But, the above statement will be called over & again in every loop, right?

About the question regarding binning, I will have to check what effect would be produced in the histograms by changing the binning. In my opinion, it would be great to check the value of variable (branch), and then fill the histogram if it is non-zero. This would also help.

Sorry to bother you over this! I am learning RDataFrame and loving it!

Yes, once per column, as you have a different model per column (in order to change name and title). Creating a TH1DModel is not an expensive operation, it doesn’t even allocate anything.

For that, instead of df.Histo1D<double>(col) you can write df.Filter(isPositive, col).Histo1D<double>(col);, where isPositive could be defined above e.g. as

bool isPositive(double x) { return x > 0; }

(feel free to substitute that with any condition you want to apply.

Cheers,
Enrico

Adding above just before defining GetValue, and using it as:

auto h = df.Filter(isPositive, col).Histo1D<double>(hmod1D, col); OR as
auto h = df.Filter(isPositive, col).Histo1D<double>(col);

gives the following error:

root [0] .L GetValue.C+
root [1] auto histos = GetValue();
Error in <TRint::HandleTermInput()>: std::runtime_error caught: 1 column name is required but none were provided and the default list has size 0
root [2] 

df.Filter(isPositive, {col}) ?

1 Like

Dear @eguiraud,

Works perfect, thank you!

BTW, is there a way to identify/rename histograms e.g. histos[37] to CL_01_E01 etc? So that, CL_01_E01->Draw() can be used instead of histos[37]->Draw().

I am still way far behind my final goal, and I am sure that I will be able to reach there with your help.

Now the next objective is to put the values of all the branches into an array/vector for every event, and use those for further analysis. In the post 6 of this discussion, you have mentioned to use:

auto vectorPtr = df.Take<double>("columnName");

I will work on it, and get back to you if I get stuck.

With best regards,

Ajay

@eguiraud

I am able to successfully process a single file using.

Now I want to process multiple files which begin with “eu152_11july_afterGlitch”. Therefore, instead of above, I used wildcard and wrote:

std::vector<ROOT::RDF::RResultPtr<TH1D>> GetValue(const char *filename = "eu152_11july_afterGlitch*.root", const char *treename = "RoseNIAS")

…and executed:

root [0] .L GetValue.C+
Info in <TUnixSystem::ACLiC>: creating shared library /home/ajay/Research/IUAC_Experiments_July2021/DATA/Source/Misc/./GetValue_C.so
root [1] auto histos = GetValue();
Error in <TFile::TFile>: file /home/ajay/Research/IUAC_Experiments_July2021/DATA/Source/Misc/eu152_11july_afterGlitch_*.root does not exist

In the body of the code, I have:

        TChain *fChain = new TChain(treename);
        fChain->AddFile(filename);

        ROOT::RDataFrame df(treename, filename);

My question is: how to setup sort for multiple files using wildcards in an argument of a macro?

Are you sure the error comes from RDF?

Please just try e.g. at prompt

ROOT::RDataFrame df("RoseNIAS", "/home/ajay/Research/IUAC_Experiments_July2021/DATA/Source/Misc/eu152_11july_afterGlitch_*.root");
df.Count().GetValue()

I am sorry @eguiraud !

The error appears to originate when I do:

TObjArray*obj = fChain->GetListOfBranches();
NoPara = obj->GetEntries();         //Number of Branches in Tree

to get number of branches in the TTree! After commenting the above statements,

std::vector<ROOT::RDF::RResultPtr<TH1D>> GetValue(const char *filename = "eu152_11july_afterGlitch_*0*.root", const char *treename = "RoseNIAS")

works as expected. Is there a simple way to obtain number of branches in a Ttree under consideration using RDataFrame?

Hi @eguiraud

In the attached code (if I am doing it right), the vectorPtr contains values of all the branches for every event. I need access to vectorPtr for every event so that further analysis can be built. I have three questions:

  1. How to access vectorPtr for every event?
  2. How to write histograms contained in the histos vector to the O/P file?
  3. There are 30308059 events when I process 3 data files. But, why do eventCount.OnPartialResult stops after only 1180000 events?

Your help is highly appreciated.

GetValue.C (4.6 KB)

Hi @ajaydeo ,
sorry for the late reply, I was off last week.

It depends on the TTree structure. If it is simple, with only top-level branches, you can use df.GetColumnNames().size().

vectorPtr is a RResultPtr<vector<double>>, you can get a vector<double> out of it e.g. with auto &vector = vectorPtr.GetValue() – but note that this starts the event loop – so you want to first book the production of all vectorPtrs and then access their contents, so they are all produced in the same event loop.

For example with fop->WriteTObject(histos[i].GetPtr()).

Because it is only executed by one of the N (4?) threads, see the docs, at the bottom.

Cheers,
Enrico

No problem @eguiraud !

In the meantime, I also completed the analysis code without RDataFrame. I will get back here, or in another post when I make some progress with the inputs that you have given so far.

Thank you once again for all you help!

Regards,

Ajay

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.