Use RDataFrame with THnSparse

Are there plans to allow the filling of a THnSparse with RDataFrame, a la Histo1D?


ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi, does the Fill action fit the bill? (https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html#a0cac4d08297c23d16de81ff25545440a)

Maybe! It looks like it’s not yet implemented for THnSparse, or perhaps I just don’t understand how to call it; I get a NotImplementedError in the following script:

from array import array

import ROOT

df = ROOT.RDataFrame(100).Define('br1', 'rdfentry_').Define('br2', 'pow(rdfentry_, 2)')

nbins = array('i', (10, 10))
xmin = array('d', (0, 0))
xmax = array('d', (100, 10000))
h = ROOT.THnSparseD('h', 'h;br1;br2', 2, nbins, xmin, xmax)

collist = ROOT.std.vector('std::string')()
for c in df.GetColumnNames():
    collist.push_back(c)

h2 = df.Fill(h, collist)

returns:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-18-bc42dad28816> in <module>()
----> 1 h2 = df.Fill(h, collist)

NotImplementedError: ROOT::RDF::RResultPtr<THnSparseT<TArrayD> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Fill<THnSparseT<TArrayD> >(THnSparseT<TArrayD>&& model, const vector<string>& bl) =>
    could not convert argument 1 (this method can not (yet) be called)

Moreover, it’s not clear from that documentation how I would Fill with weights.

Hi,
uhm not sure why it doesn’t work, I’ll have to take a look.

It’s a generic C++ template method to which you can pass any object with a Fill method, which will be called with the values of the specified columns as arguments. As such it is pretty generic and you can also use it to fill histograms with weights if the signature of the Fill method allows it.

I might be that it doesn’t work with THnSparse because it doesn’t implement an Add method, which is probably also required now that I think about it.

In the meanwhile, one way I can think of to get what you want is to use custom actions – see the tutorial here and the docs here, although to make this work in python you would need to define a C++ function that adds the custom action to a dataframe and then call it from python, which is a bit tedious.

Cheers,
Enrico

Could it be a problem that THnSparse requires an array as the argument?

Indeed, I cannot get this to work in pyROOT, even with the tutorial as-written. Details below, but in summary: I declare the class as in the tutorial but cannot find a way to call the constructor in pyROOT in a way the class recognizes–it doesn’t seem to like any of the list-building classes I’ve tried.


Declare the class (copy/paste from the tutorial):

 ROOT.gROOT.ProcessLine('''
 template <typename T, unsigned int NDIM>
 class THnHelper : public ROOT::Detail::RDF::RActionImpl<THnHelper<T, NDIM>> {
 public:
    /// This is a handy, expressive shortcut.
    using THn_t = THnT<T>;
    /// This type is a requirement for every helper.
    using Result_t = THn_t;
 
 private:
    std::vector<std::shared_ptr<THn_t>> fHistos; // one per data processing slot
 
 public:
    /// This constructor takes all the parameters necessary to build the THnTs. In addition, it requires the names of
    /// the columns which will be used.
    THnHelper(std::string_view name, std::string_view title, std::array<int, NDIM> nbins, std::array<double, NDIM> xmins,
              std::array<double, NDIM> xmax)
    {
       const auto nSlots = ROOT::IsImplicitMTEnabled() ? ROOT::GetImplicitMTPoolSize() : 1;
       for (auto i : ROOT::TSeqU(nSlots)) {
          fHistos.emplace_back(std::make_shared<THn_t>(std::string(name).c_str(), std::string(title).c_str(),
                                                       NDIM, nbins.data(), xmins.data(), xmax.data()));
          (void)i;
       }
    }
    THnHelper(THnHelper &&) = default;
    THnHelper(const THnHelper &) = delete;
    std::shared_ptr<THn_t> GetResultPtr() const { return fHistos[0]; }
    void Initialize() {}
    void InitTask(TTreeReader *, unsigned int) {}
    /// This is a method executed at every entry
    template <typename... ColumnTypes>
    void Exec(unsigned int slot, ColumnTypes... values)
    {
       // Since THnT<T>::Fill expects a double*, we build it passing through a std::array.
       std::array<double, sizeof...(ColumnTypes)> valuesArr{static_cast<double>(values)...};
       fHistos[slot]->Fill(valuesArr.data());
    }
    /// This method is called at the end of the event loop. It is used to merge all the internal THnTs which
    /// were used in each of the data processing slots.
    void Finalize()
    {
       auto &res = fHistos[0];
       for (auto slot : ROOT::TSeqU(1, fHistos.size())) {
          res->Add(fHistos[slot].get());
       }
    }
 
    std::string GetActionName(){
       return "THnHelper";
    }
 };
 '''
 )

I cannot actually construct it, however. I have tried using array.array, ROOT.std.vector, and ROOT.std.list. All give the same result:

nbins = array('i', (10, 10))
xmin = array('d', (0, 0))
xmax = array('d', (5, 5))
h = ROOT.THnHelper(float, 2)('h', 'h', nbins, xmin, xmax)
nbins = ROOT.std.vector('int')()
nbins.push_back(10)
nbins.push_back(10)
xmin = ROOT.std.vector('double')()
xmin.push_back(0)
xmin.push_back(0)
xmax = ROOT.std.vector('double')()
xmax.push_back(5)
xmax.push_back(5)
h = ROOT.THnHelper(float, 2)('h', 'h', nbins, xmin, xmax)
 nbins = ROOT.std.list('int')()
 nbins.push_back(10)
 nbins.push_back(10)
 xmin = ROOT.std.list('double')()
 xmin.push_back(0)
 xmin.push_back(0)
 xmax = ROOT.std.list('double')()
 xmax.push_back(5)
 xmax.push_back(5)
 h = ROOT.THnHelper(float, 2)('h', 'h', nbins, xmin, xmax)

all give:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-29c927b1bf48> in <module>()
      2 xmin = array('d', (0, 0))
      3 xmax = array('d', (5, 5))
----> 4 h = ROOT.THnHelper(float, 2)('h', 'h', nbins, xmin, xmax)

TypeError: none of the 3 overloaded methods succeeded. Full details:
  THnHelper<float,2>::THnHelper<float,2>(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > title, array<int,2> nbins, array<double,2> xmins, array<double,2> xmax) =>
    could not convert argument 3
  THnHelper<float,2>::THnHelper<float,2>(THnHelper<float,2>&&) =>
    takes at most 1 arguments (5 given)
  THnHelper<float,2>::THnHelper<float,2>(const THnHelper<float,2>&) =>
    takes at most 1 arguments (5 given)

Hi,
that’s interesting! Support for template methods is indeed suboptimal in current PyROOT. Things should get better when experimental PyROOT becomes the default, but still these RDF methods might require pythonizations to work properly. So I’m not sure there is a proper solution at the moment.

My actual suggestion was to do all of the RDF stuff on the C++ side, and only get the result in PyROOT:

# PyROOT
df = ROOT.RDataFrame(...)
hnsparse = ROOT.CppFunctionThatCallsFill(df)

where CppFunctionThatCallsFill is something like:

// C++
ROOT::RDF::RResultPtr<THnSparse> CppFunctionThatCallsFill(ROOT::RDF::RNode df)
{
   THnSparse ...;
   return df.Book(...);
}

But of course this is…awkward.

Okay, I’ve found a solution. Thank you for your advice and for pointing me to the tutorial. My code follows below for anyone interested down the line.


def declare():
    'declares class SparseHelper to enable RDataFrame to fill a THnSparse'
    'it is based on $ROOTSYS/tutorials/dataframe/df018_customActions.C'
    import ROOT
    
    ROOT.gROOT.ProcessLine(r'''
        // This is a custom action which respects a well defined interface. It supports parallelism,
        // in the sense that it behaves correctly if implicit multi threading is enabled.
        // We template it on:
        // - The type of the internal THnSparseT(s)
        // - The dimension of the internal THnSparseT(s)
        // Note the plural: in presence of a MT execution, internally more than a single THnSparseT is created.
        // Note also that when booking, the last column will be understood as the weight
        template <typename T, unsigned int NDIM>
        class SparseHelper : public ROOT::Detail::RDF::RActionImpl<SparseHelper<T, NDIM>> {
        public:
            /// This is a handy, expressive shortcut.
            using THn_t = THnSparseT<T>;
            /// This type is a requirement for every helper.
            using Result_t = THn_t;
        
        private:
            std::vector<std::shared_ptr<THn_t>> fHistos; // one per data processing slot
        
        public:
            /// This constructor takes all the parameters necessary to build the THnSparseTs. In addition, it requires the names of
            /// the columns which will be used.
            SparseHelper(std::string_view name, std::string_view title, std::array<int, NDIM> nbins, std::array<double, NDIM> xmins,
                      std::array<double, NDIM> xmax)
            {
                const auto nSlots = ROOT::IsImplicitMTEnabled() ? ROOT::GetImplicitMTPoolSize() : 1;
                for (auto i : ROOT::TSeqU(nSlots)) {
                    fHistos.emplace_back(std::make_shared<THn_t>(std::string(name).c_str(), std::string(title).c_str(),
                                                                 NDIM, nbins.data(), xmins.data(), xmax.data()));
                    (void)i;
               }
            }
            SparseHelper(SparseHelper &&) = default;
            SparseHelper(const SparseHelper &) = delete;
            std::shared_ptr<THn_t> GetResultPtr() const { return fHistos[0]; }
            void Initialize() {}
            void InitTask(TTreeReader *, unsigned int) {}
            /// This is a method executed at every entry
            template <typename... ColumnTypes>
            void Exec(unsigned int slot, ColumnTypes... values)
            {
                // Since THnSparseT<T>::Fill expects a double*, we build it passing through a std::array.
                std::array<double, sizeof...(ColumnTypes)> _valuesArr{static_cast<double>(values)...};
                // We understand the last item in the list as a weight
                std::array<double, sizeof(_valuesArr) - 1> valuesArr;
                for(int i=0; i<sizeof(_valuesArr) - 1; i++){
                    valuesArr[i] = _valuesArr[i];
                }
                auto w = _valuesArr.back();
                fHistos[slot]->Fill(valuesArr.data(), w);
            }
            /// This method is called at the end of the event loop. It is used to merge all the internal THnSparseTs which
            /// were used in each of the data processing slots.
            void Finalize()
            {
                auto &res = fHistos[0];
                for (auto slot : ROOT::TSeqU(1, fHistos.size())) {
                    res->Add(fHistos[slot].get());
                }
            }
            
            std::string GetActionName(){
                return "SparseHelper";
            }
        };
        ''')


def fill(df, hnm, htit, brlist, weightbr):
    'create THnSparse(hnm, htit, ...) wrapped in RResultPtr'
    
    import ROOT
    
    br_and_weight_list = brlist + [weightbr]
    
    ROOT.gROOT.ProcessLine('''
        ROOT::RDF::RResultPtr<THnSparseT<TArrayD> > CppFunctionThatCallsFillFor_{hnm}(ROOT::RDF::{dfclassnm} df)
        {{
            // Our Helper type: templated on the internal THnSparseT type, the size, the types of the columns
            // we'll use to fill.
            using Helper_t = SparseHelper<TArrayD, {ndims}>;
            
            Helper_t helper{{"{hnm}",
                            "{htit}",
                            {{{nbins}}},
                            {{{xmins}}},
                            {{{xmaxs}}}}};
            
            // We book the action: it will be treated during the event loop.
            auto myTHnSparseT = df.Book<{dattyps}>(std::move(helper), {{{br_and_weight_nms}}});
            
            return myTHnSparseT;
        }}
    '''.format(
        ndims=len(brlist),
        dfclassnm=df.__class__.__name__,
        hnm=hnm,
        htit=htit,
        nbins=','.join(str(x.nBins) for x in brlist),
        xmins=','.join(str(x.loBin) for x in brlist),
        xmaxs=','.join(str(x.hiBin) for x in brlist),
        dattyps=','.join(x.datatype for x in br_and_weight_list),
        br_and_weight_nms=','.join('"{}"'.format(x.branch) for x in br_and_weight_list))
    )
    return getattr(ROOT, 'CppFunctionThatCallsFillFor_' + hnm)(df)


def test():
    import ROOT
    
    ROOT.gROOT.SetBatch(True)
    ROOT.ROOT.EnableImplicitMT()
    declare()
    df = ROOT.RDataFrame(10000).Define('x0', 'gRandom->Uniform(-5, 5)').Define('x1', '(float)gRandom->Uniform(0, 10)').Define('x2', '(int)gRandom->Uniform(-100, 100)').Define('weight', '(double) 1.')
    
    class mybranch():
        def __init__(self, branch, nBins=0, loBin=0, hiBin=0, datatype=None):
            self.branch = branch
            self.nBins = nBins
            self.loBin = loBin
            self.hiBin = hiBin
            self.datatype = datatype
    
    brlist = [
        mybranch('x0', 10, -5, 5, datatype='double'),
        mybranch('x1', 10, 0, 10, datatype='float'),
        mybranch('x2', 10, -100, 100, datatype='int'),
    ]
    wbr = mybranch('weight', datatype='double')
    myhist1 = fill(df, 'myhist1', 'myhist1', brlist[:1], wbr)
    myhist2 = fill(df, 'myhist2', 'myhist2', brlist[:2], wbr)
    myhist3 = fill(df, 'myhist3', 'myhist3', brlist[:3], wbr)
    c = ROOT.TCanvas()
    hs = ROOT.THStack()
    hs.Add(myhist1.Projection(0))
    hs.Add(myhist2.Projection(1))
    hs.Add(myhist3.Projection(2))
    hs.Draw('nostack plc pmc')
    c.BuildLegend()
    c.SaveAs('test.png')


if __name__ == '__main__':
    test()
1 Like

Wow, well done! Hopefully in the future we’ll have a better answer…

I’m afraid I still have a problem. How can I get OnPartialResultSlot to work? Right now my action doesn’t ever get called, even if I add

/// This is a method that allows updates on the partial result
 std::shared_ptr<THn_t> &PartialUpdate(unsigned int slot){
     return fHistos[slot];
}

to the SparseHelper declaration…

Hi,
Can you produce a self-contained reproducer that I can play with? I’m not sure I understand the situation.

Cheers,
Enrico

Sure. See below. It contains the same code above, plus a PartialUpdate class member, and adds a OnPartialResultSlot action to be called every event that prints “I HAVE BEEN CALLED” using cout. When I run this script, I see no such output, even though I should see “I HAVE BEEN CALLED” printed 10,000 times…


def declare():
    'declares class SparseHelper to enable RDataFrame to fill a THnSparse'
    'it is based on $ROOTSYS/tutorials/dataframe/df018_customActions.C'
    import ROOT
    
    ROOT.gROOT.ProcessLine(r'''
        // This is a custom action which respects a well defined interface. It supports parallelism,
        // in the sense that it behaves correctly if implicit multi threading is enabled.
        // We template it on:
        // - The type of the internal THnSparseT(s)
        // - The dimension of the internal THnSparseT(s)
        // Note the plural: in presence of a MT execution, internally more than a single THnSparseT is created.
        // Note also that when booking, the last column will be understood as the weight
        template <typename T, unsigned int NDIM>
        class SparseHelper : public ROOT::Detail::RDF::RActionImpl<SparseHelper<T, NDIM>> {
        public:
            /// This is a handy, expressive shortcut.
            using THn_t = THnSparseT<T>;
            /// This type is a requirement for every helper.
            using Result_t = THn_t;
        
        private:
            std::vector<std::shared_ptr<THn_t>> fHistos; // one per data processing slot
        
        public:
            /// This constructor takes all the parameters necessary to build the THnSparseTs. In addition, it requires the names of
            /// the columns which will be used.
            SparseHelper(std::string_view name, std::string_view title, std::array<int, NDIM> nbins, std::array<double, NDIM> xmins,
                      std::array<double, NDIM> xmax)
            {
                const auto nSlots = ROOT::IsImplicitMTEnabled() ? ROOT::GetImplicitMTPoolSize() : 1;
                for (auto i : ROOT::TSeqU(nSlots)) {
                    fHistos.emplace_back(std::make_shared<THn_t>(std::string(name).c_str(), std::string(title).c_str(),
                                                                 NDIM, nbins.data(), xmins.data(), xmax.data()));
                    (void)i;
               }
            }
            SparseHelper(SparseHelper &&) = default;
            SparseHelper(const SparseHelper &) = delete;
            std::shared_ptr<THn_t> GetResultPtr() const { return fHistos[0]; }
            void Initialize() {}
            void InitTask(TTreeReader *, unsigned int) {}
            /// This is a method executed at every entry
            template <typename... ColumnTypes>
            void Exec(unsigned int slot, ColumnTypes... values)
            {
                // Since THnSparseT<T>::Fill expects a double*, we build it passing through a std::array.
                std::array<double, sizeof...(ColumnTypes)> _valuesArr{static_cast<double>(values)...};
                // We understand the last item in the list as a weight
                std::array<double, sizeof(_valuesArr) - 1> valuesArr;
                for(int i=0; i<sizeof(_valuesArr) - 1; i++){
                    valuesArr[i] = _valuesArr[i];
                }
                auto w = _valuesArr.back();
                fHistos[slot]->Fill(valuesArr.data(), w);
            }
            /// This is a method that allows updates on the partial result
             std::shared_ptr<THn_t> &PartialUpdate(unsigned int slot){
                 return fHistos[slot];
            }
            /// This method is called at the end of the event loop. It is used to merge all the internal THnSparseTs which
            /// were used in each of the data processing slots.
            void Finalize()
            {
                auto &res = fHistos[0];
                for (auto slot : ROOT::TSeqU(1, fHistos.size())) {
                    res->Add(fHistos[slot].get());
                }
            }
            
            std::string GetActionName(){
                return "SparseHelper";
            }
        };
        ''')


def fill(df, hnm, htit, brlist, weightbr):
    'create THnSparse(hnm, htit, ...) wrapped in RResultPtr'
    
    import ROOT
    
    br_and_weight_list = brlist + [weightbr]
    
    ROOT.gROOT.ProcessLine('''
        ROOT::RDF::RResultPtr<THnSparseT<TArrayD> > CppFunctionThatCallsFillFor_{hnm}(ROOT::RDF::{dfclassnm} df)
        {{
            // Our Helper type: templated on the internal THnSparseT type, the size, the types of the columns
            // we'll use to fill.
            using Helper_t = SparseHelper<TArrayD, {ndims}>;
            
            Helper_t helper{{"{hnm}",
                            "{htit}",
                            {{{nbins}}},
                            {{{xmins}}},
                            {{{xmaxs}}}}};
            
            // We book the action: it will be treated during the event loop.
            auto myTHnSparseT = df.Book<{dattyps}>(std::move(helper), {{{br_and_weight_nms}}});
            
            return myTHnSparseT;
        }}
    '''.format(
        ndims=len(brlist),
        dfclassnm=df.__class__.__name__,
        hnm=hnm,
        htit=htit,
        nbins=','.join(str(x.nBins) for x in brlist),
        xmins=','.join(str(x.loBin) for x in brlist),
        xmaxs=','.join(str(x.hiBin) for x in brlist),
        dattyps=','.join(x.datatype for x in br_and_weight_list),
        br_and_weight_nms=','.join('"{}"'.format(x.branch) for x in br_and_weight_list))
    )
    return getattr(ROOT, 'CppFunctionThatCallsFillFor_' + hnm)(df)


def test():
    import ROOT
    
    ROOT.gROOT.SetBatch(True)
    ROOT.ROOT.EnableImplicitMT()
    declare()
    df = ROOT.RDataFrame(10000).Define('x0', 'gRandom->Uniform(-5, 5)').Define('x1', '(float)gRandom->Uniform(0, 10)').Define('x2', '(int)gRandom->Uniform(-100, 100)').Define('weight', '(double) 1.')
    
    class mybranch():
        def __init__(self, branch, nBins=0, loBin=0, hiBin=0, datatype=None):
            self.branch = branch
            self.nBins = nBins
            self.loBin = loBin
            self.hiBin = hiBin
            self.datatype = datatype
    
    brlist = [
        mybranch('x0', 10, -5, 5, datatype='double'),
        mybranch('x1', 10, 0, 10, datatype='float'),
        mybranch('x2', 10, -100, 100, datatype='int'),
    ]
    wbr = mybranch('weight', datatype='double')
    myhist1 = fill(df, 'myhist1', 'myhist1', brlist[:1], wbr)
    myhist2 = fill(df, 'myhist2', 'myhist2', brlist[:2], wbr)
    myhist3 = fill(df, 'myhist3', 'myhist3', brlist[:3], wbr)
    global tokenh  # so ROOT::TPython can get it
    tokenh = myhist1
    ROOT.gInterpreter.ProcessLine(
        'std::mutex bar_mutex_{unm};'
        'ROOT::RDF::RResultPtr<THnSparseT<TArrayD> > * cpph_{unm} = (ROOT::RDF::RResultPtr<THnSparseT<TArrayD> > * )TPython::Eval("tokenh");'
        'cpph_{unm}->OnPartialResultSlot({freq}, [&bar_mutex_{unm}](unsigned int, THnSparseT<TArrayD> &){{'
        'std::lock_guard<std::mutex> l(bar_mutex_{unm});'
        'cout<<"I HAVE BEEN CALLED"<<endl;'
        '}});'
        ''.format(unm=tokenh.GetName(), freq=1)  # ensure unique definitions
    )
    myhist1.Sumw2()
    c = ROOT.TCanvas()
    hs = ROOT.THStack()
    hs.Add(myhist1.Projection(0))
    hs.Add(myhist2.Projection(1))
    hs.Add(myhist3.Projection(2))
    hs.Draw('nostack plc pmc')
    c.BuildLegend()
    c.SaveAs('test.png')


if __name__ == '__main__':
    test()

Ah, this one’s tricky: .format(unm=tokenh.GetName(), freq=1) triggers the event loop before the string is interpreted by ProcessLine changing tokenh.GetName() to e.g. "foo" gets you the printouts.

Very tricky. Maybe we need an opt-in non-lazy mode…?

1 Like

Thank you much! I did not realize GetName triggered the event loop. Thanks again for all your help on this.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.