Conversion of classical Tree reading to data frames

dudouet · June 21, 2022, 9:52am

Hy,

I am very sorry, because this is not the first time I request help for data frames, but I really would like to be able to definitively adopt data frames, and this is really not straight forward to me.

Please find here a really simple example of few lines reading a TChain with a very basic way. The only ways I have found to write it in data frames are taking 10 times more lines, so I guess I am not doing what I should. Can you please help to convert these few line to have the same result with data frames ?

    TChain *fchain = new TChain("TreeMaster");
    fchain->Add("Out/Analysis/Tree_0000.root");
    int maxHits=36;
    Int_t nbHits;
    Float_t hitX[maxHits];
    Float_t hitY[maxHits];
    Float_t hitZ[maxHits];
    Int_t hitSg[maxHits];
    Int_t hitId[maxHits];

    fchain->SetBranchAddress("nbHits",&nbHits);
    fchain->SetBranchAddress("hitX",hitX);
    fchain->SetBranchAddress("hitY",hitY);
    fchain->SetBranchAddress("hitZ",hitZ);
    fchain->SetBranchAddress("hitSg",hitSg);
    fchain->SetBranchAddress("hitId",hitId);

    map<int, array<TH2F*,6>> fHistos;

    for(Long64_t entry=0 ; entry<fchain->GetEntries() ; entry++) {
        fchain->GetEntry(entry);
        for(int mul=0 ; mul<nbHits ; mul++) {
            Int_t Det = hitId[mul];
            Int_t Seg = hitSg[mul];
            Int_t Slice = Seg%6;
            if(!fHistos.count(Det)) {
                for(int i=0 ; i<6 ; i++) fHistos[Det].at(i) = new TH2F(Form("Det%d_Slice%d",Det,i),Form("Det%d_Slice%d",Det,i),100,-50,50,100,-50,50);
            }
            fHistos[Det].at(Slice)->Fill(hitX[mul],hitY[mul]);
        }
    }

Thanks in advance for your help

Jérémie

ROOT Version: 6.26/04
Platform: MacOS

eguiraud · June 21, 2022, 11:07am

Hello Jérémie,

I see where the trouble comes from, this task is not a great fit for RDF due to the dynamic, on-demand creation of histograms as you read the data.

One can do the same thing in RDF (in roughly the same amount of lines) with a custom action. It’s not much simpler code, however, and the SetBranchAddress boilerplate is exchanged with RActionImpl boilerplate. In exchange you get the option to make the custom action thread-safe so you can run the analysis on multiple threads with few code changes, and if the analysis gets more complex RDF lets you split it in small self-contained building blocks that you can put together in the main function.

The main looks like this:

int main() {
  ROOT::RDataFrame df("TreeMaster", "Out/Analysis/Tree_0000.root");
  MyAction action;
  auto histos = df.Book<int, RVecF, RVecF, RVecF, RVecF>(action, action.fInputCols);
}

while the implementation of MyAction is (I have checked that it compiles, I could not test that it runs correctly as I’m missing the appropriate input data):

using ROOT::RVecF, ROOT::RVecI;
// simple non-thread-safe implementation
struct MyAction : public ROOT::Detail::RDF::RActionImpl<MyAction> {
  using Result_t = std::map<int, std::array<TH2F *, 6>>;
  std::shared_ptr<Result_t> fHistos{new Result_t{}};
  std::vector<std::string> fInputCols = {"nbHits", "hitId", "hitSg", "hitX", "hitY"};

  auto GetResultPtr() { return fHistos; }

  void Exec(unsigned int slot, int nbHits, RVecI &hitId, RVecI &hitSg,
            RVecF &hitX, RVecF &hitY) {
    for (int mul = 0; mul < nbHits; mul++) {
      int Det = hitId[mul];
      int Seg = hitSg[mul];
      int Slice = Seg % 6;
      if (!fHistos->count(Det)) {
        for (int i = 0; i < 6; i++)
          fHistos->at(Det).at(i) = new TH2F(Form("Det%d_Slice%d", Det, i),
                                            Form("Det%d_Slice%d", Det, i), 100,
                                            -50, 50, 100, -50, 50);
      }
      fHistos->at(Det).at(Slice)->Fill(hitX[mul], hitY[mul]);
    }
  }

  // other required methods for custom actions, unused here
  void Initialize() {}
  void InitTask(TTreeReader *, unsigned int slot) {}
  void Finalize() {}
  std::string GetActionName() { return "MyAction"; }
};

To make that work in a multi-thread run you can have one fHistos per thread (RDF will use df.GetNSlots() threads) and use the slot argument passed to Exec to fill a different map of histograms in each thread. Then you can merge them all in the main fHistos variable (the one that you return from GetResultPtr in a Finalize method.

I would be curious to hear whether there is something we could do at the level of RDF to simplify an use case such as this. Creating histograms during the event loop is of course at odds with RDF’s philosophy of “book all desired operations first, then launch a multi-thread event loop that fills all booked results”.

I hope this helps!
Enrico

dudouet · June 21, 2022, 11:36am

Hi Enrico,

Thanks for your help. Indeed this way to create the histograms in the reading is not really compatible with dataframes. But my question is still valid if I suppose that I already have created my histograms before. for example, how to quickly write something like this:

    map<int, array<TH2F*,6>> fHistos;
    for(int i=0 ; i<2 ; i++)
        for(int j=0 ; j<6 ; j++)
            fHistos[i].at(j) = new TH2F(Form("Det%d_Slice%d",i,j),Form("Det%d_Slice%d",i,j),100,-50,50,100,-50,50);

    for(Long64_t entry=0 ; entry<fchain->GetEntries() ; entry++) {
        fchain->GetEntry(entry);
        for(int mul=0 ; mul<nbHits ; mul++) {
            Int_t Det = hitId[mul];
            Int_t Seg = hitSg[mul];
            Int_t Slice = Seg%6;
            fHistos[Det].at(Slice)->Fill(hitX[mul],hitY[mul]);
        }
    }

eguiraud · June 22, 2022, 11:38am

Hi Jérémie,

your questions are all very valid!

If you know in what range the Det values fall (and the values are not too sparse), you can just fill a 4D THn with HistoND instead of using the map of arrays of 2D histograms:

THnDModel model("h", "h",
                /*nbins*/{3, 6, 100, 100},
                /*xmins*/{0., 0., -50., -50.},
                /*xmaxs*/{3., 6., 50., 50.});

RResultPtr<THnD> hists = df.HistoND<RVecI, RVecI, RVecF, RVecF>(model,
                                                                {"hitId", "hitSg", "hitX", "hitY"});

Otherwise, in order to fill the map, you can write a wrapper class that has a Fill method and fill that.
I haven’t tested the following code as I don’t have appropriate input date, but this should give you an idea. The filler class, which is simpler than the custom class in my previous answer and should work with multi-threading out of the box as long as the Merge function is implemented correctly:

class MyHistFiller {
  using Result_t = std::map<int, std::array<TH2F *, 6>>;
  Result_t fHistos;

  void InitHistos() {
    for (int i = 0; i < 2; i++)
      for (int j = 0; j < 6; j++)
        fHistos[i].at(j) =
            new TH2F(Form("Det%d_Slice%d", i, j), Form("Det%d_Slice%d", i, j),
                     100, -50, 50, 100, -50, 50);
  }

public:
  const std::vector<std::string> fInputCols = {"nbHits", "hitId", "hitSg",
                                               "hitX", "hitY"};

  MyHistFiller() { InitHistos(); }
  // we want the copy to have its own histos
  MyHistFiller(const MyHistFiller &other) { InitHistos(); }

  void Fill(int nbHits, RVecI &hitId, RVecI &hitSg, RVecF &hitX, RVecF &hitY) {
    for (int mul = 0; mul < nbHits; mul++) {
      Int_t Det = hitId[mul];
      Int_t Seg = hitSg[mul];
      Int_t Slice = Seg % 6;
      fHistos[Det].at(Slice)->Fill(hitX[mul], hitY[mul]);
    }
  }

  void Merge(const std::vector<Result_t *>&others) {
      // merge the map of histograms in `others` into `fHistos`
  }

  auto &GetHistMap() { return fHistos; }
};

and then the main is just:

ROOT::RDataFrame df("TreeMaster", "Out/Analysis/Tree_0000.root");
MyHistFiller filler;
ROOT::RDF::RResultPtr<MyHistFiller> histos = df.Fill<int, RVecF, RVecF, RVecF, RVecF>(filler, filler.fInputCols);
auto &histMap = histos->GetHistMap();

Cheers,
Enrico

dudouet · June 22, 2022, 11:59am

Thanks a lot for your help Enrico.

For what I see, even if I recognise the power in calculation time of dataframes, it seems still much less trivial to make complex analysis using dataframes than using the “old style way”.

The nexts steps are to create hundreds of histograms, based on multi loops on arrays with different multiplicities. This kind of way to fill histograms will rapidly give very complicated codes. Perhaps this is just the time to reformat my brain in ROOT data analysis

Thanks again

Jérémie

eguiraud · June 22, 2022, 12:09pm

Hi,

yes I agree that it’s quite a change and that these tasks do not map to one simple RDF call (I’ll see whether we can improve the situation for the future ).

About this I’m not completely convinced, because even in this case the analysis itself ends up being split in small focused functions or classes and the main logic composes well, hides all details of the I/O, is amenable to parallelization, etc.:

ROOT::RDataFrame df("TreeMaster", "Out/Analysis/Tree_0000.root");
MyHistFiller filler;
auto histos = df.Filter(...)
                .Filter(...)
                .Fill<int, RVecF, RVecF, RVecF, RVecF>(filler, filler.fInputCols);

auto other_histo = df.Filter(...).Define(...).Histo1D(...);
// etc.

But of course I recognize I am biased

Cheers,
Enrico

dudouet · June 22, 2022, 12:33pm

Is it planned, or is it already existing some ROOT schools to teach the new ROOT features like dataframes ?

Actually I am responsible of the data analysis part of the AGATA collaboration and I am wondering of making our softwares evolving to dataframes, but this is a lot of work… and a lot to learn

nmangane · June 23, 2022, 8:28am

In moving my analysis (HEP collider physics, CMS experiment, where we store data in TTrees such that each entry represents a proton-proton collision event and can have a variable number of particles per-event, subdivided into collections) to RDataFrame, I often had to take a step back and ask myself the question “What do I want to accomplish” because sometimes the old paradigm was hard to implement, but the old paradigm was a different route. In other words, if your goal is to get to the top of the mountain, and suddenly you have a helicopter, you should not be asking “how do I get this helicopter to roll up the mountain like my old scooter” and instead “where do I learn to fly this thing.”

Maybe it would be more helpful to lay out the high-level goal you have, what guaranteed structure your data has, etc.

system · July 7, 2022, 8:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.