Filling objects in pyROOT

vcroft · June 19, 2020, 1:28pm

Howdy, I’m wondering if there is an option/solution to the [fillAnyObject](https://root.cern/doc/master/df005__fillAnyObject_8C.html) in pyROOT yet? I’m trying to fill [RooUnfoldResponse](https://statisticalmethods.web.cern.ch/StatisticalMethods/unfolding/RooUnfold_01-Methods_PY/)Matrices for unfolding.

Currently the only way I can see it to to make all the 1Dhistograms and 2D histograms and then evaluate each of them with a:

for hist in hist_definitions:
   histA = Usual rdf.Histo1D(procedure)
   histB = Usual rdf.Histo1D(procedure)
   histC = Usual rdf.Histo2D(procedure)
   
   response = ROOT.RooUnfoldResponse(histA.GetVal(),histB.GetVal(),histC.GetVal)

which seems horribly inefficient as it will have to evaluate the whole framework for every response I want to make (there will be a couple). Any tips? If there is a way to do that can I also wrap the Miss() function too (for distributions that have x but not y)

Theres a lot of code involved currently so moving everything to C isn’t an option but a wrapper function might be ok no?

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.20
Platform: Any/all
Compiler: Python3

vcroft · June 19, 2020, 1:35pm

Thats GetValue() and I also use it for evaluating my legends in canvases… is this as inefficient as I’m as scared of?

eguiraud · June 19, 2020, 2:28pm

Hi Vince,
yes I’m afraid that runs an event loop for every hist in hist_definitions, because of the GetValues.

I’d suggest to make two loops (and push them to C via list comprehensions if they are large):

histos = [(rdf.Histo1D(...), rdf.Histo1D(...), rdf.Histo1D(...)) for hist in hist_definitions]
responses = [ROOT.RooUnfoldResponse(*map(lambda h: h.GetValue(), hs)) for hs in histos]

Unrelated: if you are running large RDF computation graphs from PyROOT, switch to 6.22 as soon as it’s out (conda-forge already has it), there are major speed improvements in RDF just-in-time compilation.

Cheers,
Enrico

vcroft · June 19, 2020, 2:44pm

Okey dokey, and about the Fill method? is that still only for the C++ purists?

Thanks for the heads up about the 6.22 being on conda forge. I’ve had about 4 hours worth of discussions over a cluster-wide installation of 6.20 just this week.

eguiraud · June 19, 2020, 2:59pm

Fill is not straightforward to call from PyROOT and it the undocumented requirement (i.e. the bug) that it only supports objects that inherit from TH1.

I don’t understand where the generic filling comes into play in your snippet above though?

vcroft · June 19, 2020, 3:07pm

in the use case above:

The object in question is a RooUnfoldResponse which has a Fill method attached to an internal TH2F. Normally the code to populate it is like this:


def smear(xt):
  xeff = 0.3 + (1.0-0.3)/20*(xt+10.0)  #  efficiency
  x = ROOT.gRandom.Rndm()
  if x>xeff: return None
  xsmear = ROOT.gRandom.Gaus(-2.5,0.2)     #  bias and smear
  return xt + xsmear

response = ROOT.RooUnfoldResponse (40, -10.0, 10.0)

for i in xrange(100000):
  xt = ROOT.gRandom.BreitWigner(0.3, 2.5)
  f0.Fill(xt)
  x = smear(xt)
  if x!=None:
    response.Fill(x, xt)
  else:
    response.Miss(xt)

This can be over-ridden with applying the TH1, and TH2s directly like before, but then I need to call the loop like a bajillion times.

eguiraud · June 19, 2020, 3:20pm

but then I need to call the loop like a bajillion times.

Note that my two-liner above produces all histograms in one event loop (by delaying the calls to GetValue until after all Histo1D and Histo2D calls have been made).

If RooUnfoldResponse does not inherit from TH1, you can work around the bug that requires it by defining a little helper object that wraps a RooUnfoldResponse and does inherit from TH1, like in this post.

Hope this helps!
Enrico

vcroft · June 19, 2020, 4:04pm

Okey dokey I’ll give that a go and let you know!

vcroft · July 3, 2020, 11:29am

Hello again, I’ve been trying to follow the suggestion:

eguiraud:

I’d suggest to make two loops (and push them to C via list comprehensions if they are large):
histos = [(rdf.Histo1D(...), rdf.Histo1D(...), rdf.Histo1D(...)) for hist in hist_definitions]
responses = [ROOT.RooUnfoldResponse(*map(lambda h: h.GetValue(), hs)) for hs

and am struggling a little.

I have a dictionary of variables each of which has a dictionary of histograms attached:

variables = {'pt':{'nominal':rdf_nom.Histo1D(...),
                 'up_var':rdf_up.Histo1D(...),
                 'down_var':rdf_down.Histo1D(...)},
             'njets':{'nominal':rdf_nom.Histo1D(...),
                 'up_var':rdf_up.Histo1D(...),
                 'down_var':rdf_down.Histo1D(...)}}

and am looking for a way to do an efficient .GetValue() on the histograms in this dictionary.

Is there some way to just evaluate all histograms defined in a RDF?

currently I have a:

histograms = {}
for v in variables:
    histograms[v] = {}
    for h in variables[v]:
        histograms[v] = { h:  variables[v][h].GetValues() }

or similar enough, and it is very very slow.

eguiraud · July 3, 2020, 11:44am

The first GetValue called triggers the event loop that fills all histograms, see e.g. the “Executing multiple actions in the same event loop” section of the docs.

Subsequent GetValue calls will just return the already filled histogram, they will not trigger other event loop.

Can you tell where time is spent in your application exactly? And in case you have many such histograms and just-in-time compilation of the event loop code is a bottleneck, can you try v6.22 to see if there is an improvement?

Cheers,
Enrico

vcroft · July 3, 2020, 12:02pm

Gotcha, ok the slow down is likely because I’m reading about a hundred trees into rdfs and applying the same computational graph on them, whereas before it was only calling the graph on the few trees I was plotting.

Ultimately this step comes before building a large RooFit model from them all so it’s good to check that this is only executing everything once.

ROOT 6.22 does indeed seem to be faster (but the docker images need updating as they’re still in ROOT 6.20 so I had to do a manual update which took a long time)

Is there an example of what you mean by “push them to C via list comprehensions if they are large”?

eguiraud · July 3, 2020, 12:38pm

True! v6.22 was officially announced yesterday, we’ll update the docker images in the coming days.

It’s a python thing: when building lists of things, list comprehensions are faster than for loops:

In [1]: l = []                                                                                                                                                                
                                                                                                                                                                              
In [2]: %timeit for i in range(100000): l.append(i)                                                                                                                           
7.58 ms ± 64.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
In [3]: %timeit l = [ i for i in range(100000) ]                                                                                                                              
3.41 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It’s not actually C, but it’s less python bytecode

Now that I know your use-case a little bit better though (a hundred medium-sized separate computation graphs) I am pretty sure that’s not a bottleneck.

Cheers,
Enrico

vcroft · July 3, 2020, 12:51pm

Yeah I’ve been trying that with the code all morning playing with lambdas and filters and the like. Problem is that I need them as dictionaries, so I either start merging rdfs, python filters, numpy named arrays or I just spend that extra few milliseconds on enjoying life. I’ve got it down to three lines and only one for loop and that will have to do.

Yeah real non-problems here. My computation of several hundred variables from with a few GB of data from a hundred different trees take me two minutes on my laptop! I want it faster damn it! and I don’t want to run it on a bigger machine either.