DataFrame define column with "input" histogram

Dear experts

I gave a first trial to “fill” a weight bracnh reading a histogram and a couple of variables on my ntuple…
The code i came up and make it run/work is something like this :

	TH1D * _histo = new TH1D( "dummy","dummy", 100,0.,100);
	auto GetHistogramVal = [&_histo](double _variable){
	    double _val = std::numeric_limits<float>::min();
        int _bin = _histo->FindBin(_variable);
        if (_histo->IsBinUnderflow(_bin) || _histo->IsBinOverflow(_bin)) {
            if (_variable <= _histo->GetXaxis()->GetXmin()) _variable = _histo->GetXaxis()->GetXmin() + _histo->GetXaxis()->GetBinWidth(1) / 100.;
            if (_variable >= _histo->GetXaxis()->GetXmax()) _variable = _histo->GetXaxis()->GetXmax() - _histo->GetXaxis()->GetBinWidth(_histo->GetNbinsX()) / 100.;
            _bin = _histo->FindBin(_variable);
        }        
        _val = _histo->GetBinContent(_bin);	    
	    return _val;
	};
         ////
         df.Define("FROMHIST", GetHistogramVal, {"pt"})

All runs without breaking nothing, and apparently fine in MT mode which i would have not expected…

The worry i have using this kind of workflow is that the only way i have to make enter the lamdba function is by capture of the argument, which means that the histogram to use needs to exists before i declare the lambda function itself and that capturing like this do not ensure me that the histogram remains immutable in all threads executed by the DataFrame.

My worries are :

  1. In my setup, the histogram could/could not exists depending on the input i have, thus i can issue or not issue the adding of this extra “branch” or column depenging on the type of input , for this purpose i have to use pointers and dereference it in calls where I enforce no-pointer semantic usage. Thus, i don’t really like capturing the pointer of the histogram in the Define, but i rather would like to capture the histogram object itself, ideally with const identification.

  2. I find a bit strange that the only arguments i can pass to the Define call has to be something belonging as column to the DataFrame. As i use a lambda function i would have expected
    Define( & h1, "pt") to work, altought i understand there is something big i am missing here.

The only reason why i would prefer to have Define("ALIAS", function, {myHisto,"pt"})
Is that in my function i can enforce const TH1D & myHisto which implies that whatever i do internally the method remains thread safe, and that , i can use a single function to “read” an input histogram and apply this lambda for “any” type of ALIAS i want to add which depends on the content of an histogram and what are the axis on which it gets defined.

I don’t know if what i am saying make sense at all but i really would like to be able to add a new column to the DataFrame reading an external “histogram” which ends up to be shared to all threads and be sure this is always working.I think this is what people typically do to attach “corrections” to the ntuples of the simulation reading some data-driven histogram of correction.

Thanks in advance
Renato


ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi Renato,
I’m not sure what your question is, exactly. I’ll try to clarify a few things.

Yes, using a lambda this seems like a sane requirement. You don’t have to use a lambda though (see below).

You can make the captured type a pointer-to-const or a const reference to prevent the lambda from modifying the variable. Example snippet.

In principle you can absolutely do that.

I’m not sure how that would work in C++. Even if we could make it work, I doubt the resulting syntax and semantics would be simpler and saner than lambda captures.

With lambdas:

const TH1D *_histo = new TH1D(...);
auto GetHistogramVal = [&_histo] (double _variable) { ...histo->GetXAxis()...};

with references:

const TH1D &_historef = *_histo;
auto GetHistogramVal = [&_historef] (double _variable) { ...histo.GetXaxis()...};

with functor (not tested, but you get the idea):

class ConstHistoGetter {
public:
    ConstHistoGetter(const TH1D& h) : _h(h) {}
    ConstHistoGetter(const ConstHistoGetter& other) : _h(other._h) {}
    double operator()(double _variable) { ... _h.GetBinContent(...) ... }

private:
    const TH1D& h;
};

ConstHistoGetter functor(*_histo);
df.Define("value", functor, {"pt"});

Hope this helps,
Enrico

Hi @eguiraud, I think the procedure I want to follow goes in the direction of using a functor.
The case is rather simple, saying in an analysis one has efficiency corrections as a function of of some observables and you want to reweight the dataframe entry by entry. You want to have a single functor capable to capture any histogram as input (maybe a functor for a th1d, one for th2d, one for th2poly). Therefore you can declare your input histo at any point of the code and add all columns in sequence just picking up the correct input variable used as axis of the reweight histo.
With lambda capture, you need to define 1 lambda per type of correction as you can capture only what already exists and have been defined. while I would like to avoid that. I give a try with the functor and eventually post here the snippet so it can serve as example for others. I guess the functor is the only way to achieve what I have in mind.
Thanks a lot,
Renato

Here a minimal example i manage to get working :


void fill_tree(const TString treeName, const TString fileName, int nEntries)
{
   ROOT::DisableImplicitMT();//With implicitMT enabled doubling entries at start...
   ROOT::RDataFrame d(nEntries);
   int i(0);
   d.Define("i", [&i]() { return (double)i; })
    .Define("j",
    [&i]() {
                 auto j = i * i;
                 ++i;
                 return j;
              })
    .Snapshot(treeName.Data(), fileName.Data());
}

class TH1DValueGetter {
public:
    TH1DValueGetter( TH1D& h) : _histo(h) {}
    TH1DValueGetter( TH1DValueGetter&& other) : _histo(other._histo) {}
    TH1DValueGetter( TH1DValueGetter& other) : _histo(other._histo) {}

    double operator()(double _variable) { 
      double _val = 1;
      double _var = _variable;
      int _bin = _histo.FindBin(_var);
      if (_histo.IsBinUnderflow(_bin) || _histo.IsBinOverflow(_bin)) {
          if (_var <= _histo.GetXaxis()->GetXmin()) _var = _histo.GetXaxis()->GetXmin() + _histo.GetXaxis()->GetBinWidth(1) / 100.;
          if (_var >= _histo.GetXaxis()->GetXmax()) _var = _histo.GetXaxis()->GetXmax() - _histo.GetXaxis()->GetBinWidth(_histo.GetNbinsX()) / 100.;
          _bin = _histo.FindBin(_var);
      }     
      _val = _histo.GetBinContent(_bin);
    return _val;
    }
private:
    TH1D _histo;
};

void miniTest(){
  fill_tree( "test", "test.root", 1000);
  TH1D * h1 = new TH1D("hgaus", "hgaus", 500, 0, 1000);
  //a random TH1D gaussian distributed histo
  TRandom3 rnd;
  double mean = 500;
  double sigma = 50;
  for( int i =0; i < 100000; ++i){
    h1->Fill(rnd.Gaus(mean,sigma));
  }
  h1->Draw();
  h1->Scale( 1./h1->GetEntries());
  ROOT::EnableImplicitMT();
  auto df =  ROOT::RDataFrame("test", "test.root");
  TH1DValueGetter valueGetter(*h1);
  df.Define("weight", valueGetter, {"i"})
    .Snapshot("testW", "testW.root");
  ROOT::DisableImplicitMT();
}

Unfortunately i cannot promote the TH1D to be const as some of the methods are not const marked.
Also , i had to add in the functors a move constructor otherwise the code was breaking.
A side comment : I noticed that when fillng the initial TTree if i enable MT, i get duplicate value entries, while i don’t when i disable it.

It’s probable that if a method is not marked const, it’s not in general thread-safe. Be careful! :smile:

I think it’s because your copy-constructor is wrong: it should take a const reference, not a reference. The way it is, it can’t bind to temporaries. This works, for example:

root [0] ROOT::RDataFrame df(10)
(ROOT::RDataFrame &) An empty data frame that will create 10 entries
root [1] struct Foo { int operator()() { return 42; } };
root [2] *df.Define("x", Foo()).Take<int>("x")
(std::vector<int, std::allocator<int> > &) { 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 }

Your Define lambdas for “i” and “j” in fill_tree are not thread-safe, so that could very well happen.
A thread-safe version:

d.Define("i", [](ULong64_t e) { return double(e); }, {"rdfentry_"})
 .Define("j", [](double i) { return i*i; }, {"i"});

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.