Trying to handle RDataFrames

dudouet · April 9, 2021, 3:01pm

rootfilte_test.root (427.9 KB) Dear all,

Since many years I am using TSelectors and Proof, but it seems that the future of ROOT analysis is going toward the use of RDataFrames, so I am trying to adapt myself to this new tool (new for me). But I am facing some difficulties. I have looked on the forum and on the ROOT website but I am still not able to do the same kind of analysis that the one I usually do using TSelectors. Can you help me ?

The fact is I have succeeded to analyse some trees and make some plots in very easy cases (drawing directly the branches of the tree, or making some selections on branches that are not arrays). But when I want to make more complex operations between branches, and using conditions on only some events of an array, I don’t see how to do.

I give you a very simple test case that I would like to be able to reproduce (and the associated tree). Could you help me to translate this simple analysis in the rdataframe language, to help me to understand how I am supposed to do for more complex stuffs.

Thanks in advance
MySelector.C (3.7 KB) MySelector.h (5.1 KB) rootfilte_test.root (427.9 KB)

eguiraud · April 9, 2021, 4:36pm

Hello @dudouet ,
the idea in RDataFrame is that for each event, you take in arrays of track properties and make an array of “cosThetas” out of the “good tracks”. Then you just fill a histogram with all the good “cosThetas”.

I think it would look something like the following (have not tested it, but it should convey the idea).
The bulk of the analysis:

ROOT::RDataFrame df("treename", "rootfilte_test.root");
auto h = df.Filter("nbTrack > 0")
           .Define("goodTrackMask", SelectGoodTracks, {"trackE", "trackType"})
           .Define("cosThetas", CosThetas,
                   {"trackX1", "trackY1", "trackZ1",
                    "trackX2", "trackY2", "trackZ2",
                    "goodTrackMask"})
           .Histo1D("cosThetas");

SelectGoodTracks and CosThetas could be defined like this:

using RVecI = ROOT::RVec<int>;
using RVecF = ROOT::RVec<float>;

// return a mask {0, 1, 1, 0, ...} where 1 means "this is a good track"
RVecI SelectGoodTracks(const RVecF &trackE, const RVecI &trackType) {
  return abs(trackE - 600) < 5 && trackType != 2;
}

// the bulk of your analysis logic: take in the tracks, return their cosThetas 
RVecF CosThetas(const RVecF &trackX1, const RVecF &trackY1,
                const RVecF &trackZ1, const RVecF &trackX2,
                const RVecF &trackY2, const RVecF &trackZ2, const RVecI &mask) {

  const auto goodTrackIdxs = VecOps::NonZero(mask);
  RVecF cosThetas;
  for (auto i : goodTrackIdxs) {
    TVector3 Int1(trackX1[i], trackY1[i], trackZ1[i]);
    TVector3 Int2(trackX2[i], trackY2[i], trackZ2[i]);
    cosThetas.push_back(CalcCosTheta(0, 0, 0, Int1, Int2));
  }

  return cosThetas;
}

And this is really all the code you need (plus your definition of CalcCosTheta, that could also be inlined into the body of CosThetas). You can add ROOT::EnableImplicitMT() at the top to run the analysis in parallel on all your processors.

Feel free to ask further clarifications. The docs for RVec are here.

Cheers,
Enrico

EDIT: the return type of CosThetas was float instead of RVec<float>

dudouet · April 11, 2021, 9:08am

Hi,

thank you for the help ! I will try to do my real analysis using this I will come back if I have other questions.

Jérémie

dudouet · April 12, 2021, 7:12am

Hi,

me again. Just to be sure, now if I want to draw other quantities using the same condition, what is the most efficient way ? I have tried for example to plot the tracked energy, using this command line:

auto select_good_tracks = [](const RVecF &trackE, const RVecI &trackType) {
 return abs(trackE - 661) < 5 && trackType != 2;
};

auto good_tracks = fdata_frame.Filter("nbTrack > 0").Define("good_track_mask", select_good_tracks, {"trackE", "trackType"});

htrackE_ok = good_tracks.Define("trackE_ok","trackE[good_track_mask]").Histo1D({"tracked_gated_energy","tracked_gated_energy;energy (keV);counts",2000,0,2000},"trackE_ok");

Is is the good way to proceed, or should I do using another method ?

Jérémie

eguiraud · April 12, 2021, 7:26am

That’s great!

For best performance (might or might not make a difference in your application, and the tradeoff in convenience might or might not be worth it):

prefer C++ lambdas/functions to strings in Filter and Define calls: e.g. Filter([] (int nbTracks) { return nbTracks > 0; }, {"nbTracks"}) rather than Filter("nbTracks > 0")
specify column types as template arguments in actions: e.g. Histo1D<RVecF>("pts") instead of Histo1D("pts")
compile your code with optimizations rather than interpreting it or running it in a notebook: e.g. .x macro.C+ or g++ -O2 -o analysis analysis.cpp $(root-config --libs --cflags) or equivalent rather than root macro.C

Cheers,
Enrico

dudouet · April 12, 2021, 7:31am

Thanks a lot. I got it. This topic is more to help me to understand how to use correctly these new tools, not to focus on these dummy cases, so I appreciate your help for optimization. My program is actually already a compiled program using cmake and these lines are inside a c++ class. I see that there is a lot of new very useful methods to learn. The RVec seems to give many possibilities. I will try as much as possible to change my habits and to progressly move my analysis code to rdataframes.

eguiraud · April 12, 2021, 7:41am

Good to hear!

The RDF user guide should highlight most/all features, and we have tutorials that show realistic analysis usecases (see df101, df102 and following).

RunGraphs is a useful helper when the analysis becomes too complex for a single computation graph.

Cheers,
Enrico

dudouet · April 12, 2021, 1:45pm

Hi again,

I am trying do to a lambda function that can take an option that is not a column, to avoid defining two times the almost same function. but it’s does not seems to work:

   int iint=1;
    auto get_trackE = [&iint] (const RVecF &_trackX1, const RVecF &_trackY1, const RVecF &_trackZ1,
                              const RVecF &_trackX2, const RVecF &_trackY2, const RVecF &_trackZ2,
                              const RVecF &_hitGX,   const RVecF &_hitGY,   const RVecF &_hitGZ, const RVecF &_hitE,
                              const RVecI &mask) {
        RVecF trackE;
        cout<<iint<<endl;
        const RVecF *trackX, *trackY, *trackZ;
        if(iint==1) { trackX = &_trackX1;trackY = &_trackY1;trackZ = &_trackZ1;}
        else if(iint==2) {trackX = &_trackX2;trackY = &_trackY2;trackZ = &_trackZ2;}
        else return trackE;
        const auto goodTrackIdxs = VecOps::Nonzero(mask);
        for (auto i : goodTrackIdxs) {
            for(size_t ihit=0 ; ihit<_hitGX.size() ; ihit++) {
                if( abs(_hitGX.at(ihit)-trackX->at(i))<0.01 &&
                    abs(_hitGY.at(ihit)-trackY->at(i))<0.01 &&
                    abs(_hitGZ.at(ihit)-trackZ->at(i))<0.01)
                    trackE.push_back(_hitE.at(ihit));
            }
        }
        return trackE;
    };
    
    htrackE1 = good_tracks_energy.Define("trackE1", get_trackE, {"trackX1", "trackY1", "trackZ1","trackX2", "trackY2", "trackZ2","hitGX","hitGY","hitGZ","hitE","good_track_mask"}).Histo1D("trackE1");

    iint=2;
    htrackE2 = good_tracks_energy.Define("trackE1", get_trackE, {"trackX1", "trackY1", "trackZ1","trackX2", "trackY2", "trackZ2","hitGX","hitGY","hitGZ","hitE","good_track_mask"}).Histo1D("trackE1");

This prints me some dummy values for iint (0 or 32704). I am also discovering lambda functions actually. What is the best way to be able to give a parameter to the function ?

Jérémie

eguiraud · April 12, 2021, 3:25pm

The problem is that when the lambda is executed, i.e. during the event loop (which is run lazily), iint is always 2.

For the particular case you presented, I think you can do:

auto get_trackE = [] (const RVecF &_trackX, const RVecF &_trackY, const RVecF &_trackZ,
                      const RVecF &_hitGX,  const RVecF &_hitGY, const RVecF &_hitGZ, const RVecF &_hitE,
                      const RVecI &mask) {
        RVecF trackE;
        const auto goodTrackIdxs = VecOps::Nonzero(mask);
        for (auto i : goodTrackIdxs) {
            for(size_t ihit=0 ; ihit<_hitGX.size() ; ihit++) {
                if( abs(_hitGX.at(ihit)-trackX[i])<0.01 &&
                    abs(_hitGY.at(ihit)-trackY[i])<0.01 &&
                    abs(_hitGZ.at(ihit)-trackZ[i])<0.01)
                    trackE.push_back(_hitE.at(ihit));
            }
        }
        return trackE;
    };

auto htrackE1 = good_tracks_energy.Define("trackE1", get_trackE, {"trackX1", "trackY1", "trackZ1","hitGX","hitGY","hitGZ","hitE","good_track_mask"}).Histo1D("trackE1");
auto htrackE2 = good_tracks_energy.Define("trackE1", get_trackE, {"trackX2", "trackY2", "trackZ2","hitGX","hitGY","hitGZ","hitE","good_track_mask"}).Histo1D("trackE1");

I.e. reuse the logic, just change the column names.

dudouet · April 12, 2021, 4:23pm

Indeed, for this specific case what you propose is really better. But is there a way to give to a function, or to a method of the class, some args that are not column names ?

eguiraud · April 12, 2021, 4:59pm

Yes, to pass some constant it’s totally fine to use lambda captures as you did (but one lambda <-> one value of the captured variable).

To do something along the lines of your last example above, you would use functor classes:

struct MultiplyBy {
  double _coeff;
  MultiplyBy(double coeff) : _coeff(coeff) {}
  float operator()(double x) { return x*by; }
};

and

df.Define("xby3", MultiplyBy(3), {"x"}).Define("xby4", MultiplyBy(4), {"x"});

EDIT:
with C++14 you can also use “lambda factories”, which saves characters but looks a bit more exoteric:

auto MultiplyBy(double coeff) {
  // return a lambda that multiplies its argument by `coeff`
  return [coeff](double x) { return x*coeff; }
}

df.Define("xby3", MultiplyBy(4), {"x"});

dudouet · April 13, 2021, 1:53pm

Hi,

me again

Is it possible to create a ROOT::RVec<a_struct> , defined using something like this:

struct track_entry {
    TVector3 pos{};
    float E=0.;
    track_entry(){ ; }
    track_entry(const float &x1, const float &y1, const float &z1, const float &e) :
        pos(x1,y1,z1),
        E(e){ ; }
};

My idea is to create such a vector with the input of my tree, to then easily plays with my different arrays. I wanted to do something like this:

auto extract_tracked_gammas = [](const RVecF &_trackX1, const RVecF &_trackY1, const RVecF &_trackZ1, const RVecF &_trackE, const RVecI &_trackType) {
        RVecTrack vec;

        for(size_t itrack = 0 ; itrack<_trackX1.size() ; itrack++) {
            // check track type and energy gate
            if( _trackType.at(itrack) != 2 || abs(_trackE.at(itrack) - 661) > 5 ) continue;

            vec.push_back(track_entry(_trackX1.at(itrack), _trackY1.at(itrack), _trackZ1.at(itrack),_trackE.at(itrack)));
        }

        return vec;
    };

auto tracked_gammas = fdata_frame.Define("tracked_gammas",extract_tracked_gammas,{"trackX1", "trackY1"  , "trackZ1","trackE" , "trackType"});

This seems to work, but then I don’t know how to use these new RVecTrack to fill some hists. I can still do something like:

auto func = [] (const RVecTrack &vec) {
        RVecF res;
        for (auto &i : vec)
            res.push_back(i.E);
        return res;
    };
    tracked_gammas.Display("tracked_gammas");
    htrackE = tracked_gammas.Define("_trackE",func,{"tracked_gammas"}).Histo1D("_trackE");

but this is quite boring. Is there a way to access to the variables of structure using the strings directly ? Or do my idea is just a bad idea… ?

eguiraud · April 14, 2021, 8:11am

I don’t think there is a way around doing the unpacking before filling.
You can express both the packing and the unpacking more concisely, e.g.:

auto extract_tracked_gammas = [](const RVecF &_trackX1, const RVecF &_trackY1, const RVecF &_trackZ1, const RVecF &_trackE, const RVecI &_trackType) {
   auto mask =  _trackType == 2 && abs(_trackE - 661) <= 5;
   return Construct<track_entry>(_trackX1[mask], _trackY1[mask], _trackZ1[mask], _trackE[mask]);
};

and

auto func = [] (const RVecTrack &vec) {
  return Map(vec, [] (const track_entry &t) { return t.i; });
}

You can also just write

Define("to_fill", "Map(tracked_gammas, [](auto &t) { return t.i; }")

if you let ROOT’s C++ interpreter cling know about your track_entry type, e.g. by building dictionaries for it.

Cheers,
Enrico

dudouet · April 15, 2021, 8:45am

Hi,

I am starting to have something similar in the results to what I was obtaining with my standard TSelector analysis, but the processing is slower (without enabling multi-thread, for a fair comparison) with what I had using TSelector. I am pretty sure that the way I am doing is not optimized, I am doing many times similar things in different loops, but I don’t see how to do better with my current understanding of rdataframes.

I don’t want to waste your time, so no problem if you have no time for it, but do you think that if I send you the current code I am using, with enough comments for allowing you to understand what I am trying to do, you could have a look and help me to see where and how it could be optimized ? If yes, I will prepare the code to be a bit cleaner and commented.

Jérémie

eguiraud · April 15, 2021, 8:55am

Yes please, we don’t want to be slower than TSelector

Send it with data that I can run it on, so I can just check where time is spent.

dudouet · April 15, 2021, 8:58am

Great thank you. Let me some time to clean and comment the and give you some data Thanks a lot.
If I give you directly a git link, is it ok for you ? Like this you can get exactly the same code than me, with the cmake compilation etc that I am using ?

eguiraud · April 15, 2021, 8:59am

Sure – I don’t know when I’ll be able to take a look, but I’ll take a look.

dudouet · April 15, 2021, 2:35pm

Hi, before sending you the code, a simple question. For what I have understood, there is no possibility to limit the range of data we want to read when we are in multi-thread. That’s a pity actually…

For tests, I would like to have the possibility to put an option for using or not a range, and if used, to not perform the program in multi thread, but I don’t see how to do that. I have tried something like this:

if(fentries_to_treat>0) fdata_frame.Range(0,fentries_to_treat);

but it will not work because the effect is only produced on the returning value of this command. And as I cannot apply the range command when MT is on, I need that my next operations, that are applied on the fdata_frame variable are well applied whereas I am or not in MT. Do you see my point ?

And something like this cannot compile:

auto _rangedframe = fdata_frame;
if(fentries_to_treat>0) _rangedframe = _rangedframe.Range(0,fentries_to_treat);

Any hints ?

dudouet · April 15, 2021, 2:41pm

For the code I was talking about before, you can download a data-set at this link.

For the code, you can get it from the gitlab repository: https://gitlab.in2p3.fr/dudouet/gray-imager.git

I’m sure you know perfectly how to compile it, but if needed, some documentation is written in the readme.

The core of the code is done in the file: gi_reader.cpp. I have written as many information in comment that I could to help you to understand what are the input data, and what I want to do with it. Let me know if something is missing or not understable.

For information, I have done exactly the same analysis using a TSelector, if needed I can send it to you also. Without threads, the TSelector runs in around 20 minutes, while the rdataframe code takes almost one hour… hopefully, with the 16 threads of my laptop it’s much faster.

Big thanks for your help

Jérémie

eguiraud · April 15, 2021, 3:47pm

On one hand:

ROOT::RDF::RNode rangedDF = df;
if (fentries_to_treat > 0)
  rangedDF = df.Range(0,fentries_to_treat);
...
// use rangedDF as usual

on the other hand, you can construct a TTree or a TChain with a TEntryList to limit the range of entries read and then pass that TTree/TChain to RDF’s constructor.