RDF Divide entries (or events) by the run number

Dear experts,

In a recent project using RDataFrame I want to divide the entries (or events) according to their run number (a branch). What I’m using now is a for loop and a filter inside the for loop, as shown in the code below.

// Example on how I divided entries by run numbers
for (int i = 0; i < runNum; i++) {
  auto df3 = df2.Filter([i](const int run){return (bool)(run==i);}, {"run"});
  resultVector.emplace_back(df3.Histo1D("Branch_A"));
}

However, in this way for each event I’m comparing its run number with all possible run numbers, which is probably unnecessary and scales badly with the total run numbers. For example, in a plain event loop I can probably use std::map for this and that would be more efficient.

So is there a better way to do this in RDF? Thanks!

Best,
Kevin


ROOT Version: 6.16.00
Platform: CentOS Linux 7 (lxplus)
Compiler: gcc 4.8.5


Maybe @dpiparo can give some hints

Hi,

let me try to rephrase this: you would like to create categories of your entries according to the value of the run they belong to. In order to do that, you can use the run number stored in a branch of your tree, let’s say it’s called runNumber. Then you want to process those categories, say, to create a histogram for each one of them.

This is a very nice feature which RDF does not provide yet in its full glory (a “multifilter/categoriser” @StephanH ) but can be emulated with filters. Are the values of the run number varying arbitrarily or you have the first n_0 entries in run r_0, the second bunch of n_1 entries in run r_1 and so on?

Cheers,
D

Hi,
What about using a TH2D where the second dimension is the run number?

df.Histo2D("Branch_A", "run")

Then you can take slices/projections/profiles of it to retrieve a TH1 per run number.

Alternatively, you can create a simple custom action class that you register with RDataFrame through the Book action. This is a bit more complicated but we have a tutorial that shows how to create a thread-safe custom action here: https://root.cern.ch/doc/master/df018__customActions_8C.html

Cheers,
Enrico

Hi,

Thank all of you for your replies!

@eguiraud These seem to be nice workarounds! Probably Histo2D can be a simple and good way out of this for now. Your link for Book is also very helpful. It seems to be a fill by slot + merge.

@dpiparo Yes you precisely rephrased what I want!

Run number here is more like an example. In my projects I pretty much always find myself in need of categorizing things by the run number, pT intervals, eta intervals, centrality bins and so on.

Sometimes their values vary arbitrarily and sometimes they tend to come one by one (“have the first n_0 entries in run r_0, the second bunch of n_1 entries in run r_1 and so on”), depending on the case.

May I ask how you would emulate it with filters?

====================================

So far I’ve been using for loop + filter (as posted) or sometimes higher dimensional histograms as workarounds, however they are a bit “cumbersome”. (RDF usually seems fairly clear and straightforward.)

Since this situation happens very often (at least for me) and seems a little not straightforward to deal with (again at least for me), I’m wondering if there’ll be a more “native” “multifilter/categoriser”, as @dpiparo mentioned, available in the future. (And just curious is O(log n) probably too crazy? :wink:)

Sorry for the long post and thank you!

Best,
Kevin

I think I had a similar issue. My current way of using Filters to create categories in one variables looks like this.

std::vector<ROOT::RDF::RNode> slice(ROOT::RDF::RNode node, 
    std::vector<double> binedges, std::string varname, 
    std::vector<ROOT::RDF::RNode> &parents,
    std::string cutname = ""){

    parents.push_back(node);

    std::vector<double> left;
    std::vector<double> right;

    if(binedges.empty()){
        std::vector<ROOT::RDF::RNode> empty;
        empty.push_back(node);
        return empty;
    }

    int i_slice = binedges.size() / 2;
    double sliceval = binedges[i_slice];

    std::string cutleft = Form("%s <= %f", varname.c_str(), sliceval);
    std::string cutright = Form("%s > %f", varname.c_str(), sliceval);

    ROOT::RDF::RNode f_left = node.Filter(
                [sliceval](double slicevar){return (slicevar <= sliceval);},
                {varname.c_str()},
                (cutname+" "+cutleft).c_str()
                );

    ROOT::RDF::RNode f_right = node.Filter(
                [sliceval](double slicevar){return (slicevar > sliceval);},
                {varname.c_str()},
                (cutname+" "+cutright).c_str()
                );

    if(binedges.size() == 1){
        std::vector<ROOT::RDF::RNode> result;
        result.push_back(f_left);
        result.push_back(f_right);
        return result;
    }

    left.insert(left.end(), binedges.begin(), binedges.begin()+i_slice);
    right.insert(right.end(), binedges.begin()+i_slice+1, binedges.end());

    auto fs_left = slice(f_left, left, varname, parents, cutname+" "+cutleft);
    auto fs_right = slice(f_right, right, varname, parents, cutname+" "+cutright);
    fs_left.insert(fs_left.end(), fs_right.begin(), fs_right.end());
    return fs_left;
}

I did not really check if this is more performant than creating n-bin Filters. The number of Filters needed is higher, but most Filters are calculated only on a subset of events. I think for a uniformly distributed variable the number of filter calculations should be something like 2 n ln(k), with n number of events and k number of categories. For using k individual filters executed on all events there are n*k calculations.

I would also appreciate a nice multifilter RDF feature.

Best,
Christian

Hi @cwiel, @Kevin1,
thank you for your feedback!

How do you imagine a “categorizer” should look like? Would

RResultPtr<vector<TH1D>> histos = df.Categorize(/*category_edges=*/{0, 0.5, 1},
                                                /*columns=*/{"value", "category"})

fit the bill? Would it cover all of your usecases?

Note that, when viable, using a TH2D as a workaround to fill N “categorized” TH1D is actually the fastest way (and, most probably, smaller or equal runtime than the “categorizer” we could implement in RDF) to produce such results in ROOT.

Cheers,
Enrico

I would prefer something like a vector of Nodes as return type to be able to slice/categorize in multiple variables one after the other.

I see, uhm…the naiive way to implement that internally would be equivalent to what @Kevin1 showed in his original post (N filters called at each entry), and Categorize would just be syntactic sugar on top. But that’s slow.

Implementing a performant Categorize that returns generic RNodes is quite a bit more complicated (because of how things work internally). But it’s a requirement if we want users to be able to “categorize” multiple variables without checking the category multiple times.

Hi,

@cwiel thank you for your code snippet! It seems to avoid many unnecessary comparisons.

@eguiraud thank you for your advice on TH2D!

@cwiel I’m not sure if I understand you correctly, but I think I would also like a return type of RNodes in a container for more flexibility.

I’m thinking about something like this:

std::vector<int> runNumberCategory{367170, 367233, 367273};

std::map<int, ROOT::RDF::RNode> nodePerRunNumber = df
    .Categorize("branchRunNumber", runNumberCategory);

auto h = nodePerRunNumber.at(367170).Histo1D("energy");

So the given categories would be the key and the RNodes would be the value and the return type is a std::map of them.

Comparison with Filter:
Filter: “1” input branch --> “1” comparison --> 1 output RNode
N-Categorize: “1” input branch --> < N comparisons --> N output RNode

(Comparisons in N-Categorize can be internally done via for example, red-black tree like key comparison in std::map.)

And in my prototype example categories are int and comparison is trivial ==. They can probably be generalized later, for example float and user-defined Compare comp.

@eguiraud But of course these are just my naive thinking. I don’t really know how RDataFrame works internally. This is just an example with some more details to show what I would look forward to. Would something like this still be crazy to implement? Thanks!

Best,
Kevin

Ah yes, categories could be ranges or also single values. Right.

It’s not crazy at all, but it’s not trivial to provide a performant (competitive with the TH2D workaround) implementation that returns RNodes rather than TH1Ds – because nodes are very generic objects the code that navigates them during the event loop is not as optimized as choosing what cell of a TH2D to fill.

On the other hand, it’s easy to implement an unoptimized Categorize function that does what you want – so easy that I would be inclined to just leave it as an helper function (maybe in RDFHelpers.hxx) with a big disclaimer that says that it’s just syntactic sugar over a bunch of Filters:

template <typename T>
std::map<int, RNode>
CategorizeOnValues(RNode df, const std::string &colName, const std::vector<T> &categories) 
{
  std::map<T, RNode> nodes;
  for (auto c : categories) {
    nodes[c] = df.Filter([c](int val){return val == c;}, colName);
  return nodes;
}

to be used as

std::vector<int> runNumberCategory{367170, 367233, 367273};
auto nodePerRunNumber = Categorize(df, "runNumber", runNumberCategory);

And similarly you could implement CategorizeOnRanges. You can use these function in your code already.

Thanks!

So when a higher dimensional histogram is available I’ll just use that.

When the performance isn’t such a key factor I would probably use this syntactic sugar.

Best,
Kevin

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.