RDataFrame, plotting 2D vector into 1D histogram


Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.32.02
Platform: macosx64
Compiler: not sure (homebrew build)


I am using RDataFrame with pyroot. In my Tree each event has some vector* branches. What I want to do is I want to draw this 2 dimensional data into 1D histogram.
For example:

event 1  -> [ 1, 2, 3 ] 
event 2  -> [ 10, 20, 30 ]
event N -> [ .. .. .. ]

I want a 1D histogram for this data set  [ 1, 2, 3 ,10, 20, 30 , .. .. . .]

I think, I have to “Define” a new branch with RDataFrame to use “Histo1D” function for this purposes. But I cannot manage how can I define a new branch for this 2D to 1D conversion.

Hi Gokhan,

Thanks for the post and welcome to the ROOT Community!

If I understand correctly (please correct me if I am wrong!) your dataset has columns which hold collections (std::vector<T>) and you want to, for all events and all elments in such collections within every event, fill a monodimensional histogram.

For achieving that goal, no particular operation needs to be performed. It is enough to invoke the Histo1D action and internally all loops will be handled transparently. See for example this tutorial ROOT: tutorials/dataframe/df002_dataModel.py File Reference , and in particlar the line trPts = augmented_d.Histo1D("tracks_pts")where the column tracks_pts is a collection of double precision floating point numbers.

I hope this helps.

Cheers,
D

Hi Danilo,

Thank you for your fast response. Indeed, RDataFrame handles the vector branch as you said. It works like butter now.

I have a similar question for another data type. For, a branch which is holding <vector><TVector3>* values, I want to draw separate 1D histogram for all vectors x, y, z components as in my previous question.

For this purpose, I have created 3 new branches vector branches for x, y, z component using a flowing function. It works and output result is correct.

ROOT.gInterpreter.Declare('''
std::vector<float> getVectorComponent(const ROOT::VecOps::RVec<TVector3>& vecIn) {
    std::vector<float> out;
    out.reserve(vecIn.size());
    for (const auto& v : vecIn) {
        out.push_back(v.X()); 
    }
    return out;
}
''')

_tmp = DF.Define("sampleVector_x", "getVectorComponent(sampleVector)")

Do you think this is a proper way to do it? It is working but really slow.

Hello,

I am surprised you say it’s really slow. How slow? Can you quantify?

Your code is great. You can even use it without interpreting it on the fly, for example:

std::vector<float> getVectorComponent(const ROOT::VecOps::RVec<TVector3>& vecIn) {
    std::vector<float> out;
    out.reserve(vecIn.size());
    for (const auto& v : vecIn) {
        out.push_back(v.X()); 
    }
    return out;
}

_tmp = DF.Define("sampleVector_x", getVectorComponent, {"sampleVector"});

I hope this helps.

Cheers,
D

Maybe I should say relatively slow.

What I am doing is creating multiple histograms from root data file. My sample data file is about 5Gigs and I can create total 526 histograms over 221 branches, which takes about 6 min 47 sec.

real	6m47,808s
user	6m24,254s
sys	0m13,556s

But when I include just two branches into my program, the total run time is over 8 minutes. While this step only add two new branch and total 12 new output png files the increase in processing time (1m 25sec) seems much more then I excepted.

real	8m12,671s
user	7m38,632s
sys	0m18,018s

I don’t know maybe my expectations are odd.

I can not do that in pyroot I guess. Can I ?

Hi,

I apologise: I did not notice your code was Python!
For now, we do not fully support that case, but we are working on it.

We are doing more things, so it’s kind of expected that the time increases. The delta seems a bit high though, as you pointed out. Are you sure you are running the event loop just once?

To improve your experience, have you tried running MT with ROOT::EnableImplicitMT()?

Cheers,
D

I tried that. But overall RAM consumption is getting high with parallel runs.

Yes, indeed I run event loop only once.