Commands on RResultPtr<TH1D> object stuck

dabbott · November 20, 2018, 4:50pm

Hello,

I am calling a Histo1D() on an rdataframe that returns a RResultPtr< TH1D> object. Calling any th1d function on this causes my program to lock for a very long time. For example histo = rdf.Histo1D(), histo.Draw() or histo.clone(“name”) will become stuck. I have run other scripts drawing a RResultPtr< TH1D> and it has worked properly.

Note that I am using rdataframe with python. I am running on a tchain and for testing purpose I use Range(10).

Here is a more exact snippet of code,

first = ""
for cur_slice in ["361024", "361025", "361026", "361027"]:
    print "Cur slice is: " + cur_slice
    slice_branch = "hcand_boosted_pt_lead_"+cur_slice
    slice_weight = "total_weight_"+cur_slice
    rdf_mod = rdf_full.Filter("hcand_boosted_pt.size()>0").Filter("mcChannelNumber=="+cur_slice).Define(slice_branch, "hcand_boosted_pt.at(0)").Define(slice_weight, "mcEventWeight*weight_pileup*"+str( alg.ret_slice_weight(cur_slice) ) )
    pt = rdf_mod.Histo1D( slice_branch, slice_weight )
    pt.Draw(first)
    first = "same"

ROOT Version: 6.14
Python: 3.6
Platform: Manjaro
Compiler: Not Provided

eguiraud · November 20, 2018, 5:21pm

Hi @dabbott, how long is long? The first time you call a method on a RResultPtr the event loop is run to produce the desired histogram result.

It’s true that a Range(10) should make the event loop very short, and you could have at most a few seconds of just-in-time compilation that drive the runtime up.

I don’t see anything out of place in the snippet you posted either.

It would be nice to have a minimal code reproducer with some data to check what’s going on.

Cheers,
Enrico

dabbott · November 20, 2018, 8:20pm

Hi Enric,

Thank you for the response. In that loop for example, it took anywhere from 30 minutes to an hour. I was very skeptical at first that it was simple a server side problem, but running another python script with a root draw finished in about 2 minutes (no rdf).

Is there anything else I can provide you that might help?

Thanks,
Dale

eguiraud · November 20, 2018, 8:45pm

Hi @dabbott,
I see, definitely something worth looking into.
Like this I have no way to investigate what’s wrong, ideally I’d need access to (a small fraction of) the data and the minimal working snippet of code that reproduces the issue, so I can reproduce the problem and debug it!

Cheers,
Enrico

dabbott · November 21, 2018, 11:54am

Hi Enrico,

I believe I narrowed it down to using a tchain. Right now I am using 31 similar ttrees. I ran some test playing around with the tchain size to get an idea of what is happening here. Here is what I see as far as run time (local, single node):

1 file ~40s
2 files ~1m30s
9 files ~6m9s
16 files ~ 7m9s
23 files ~ 20m40s
31 files ~ 21m20s

Obviously this is not a science but it is revealing. Note that this is much faster than the last two days, implying there is at least partially a slowdown just due to my server. That said, is there a nice way to run my script on the tchain on the local batch? This could improve my run time as well.

Dale

eguiraud · November 21, 2018, 12:14pm

Hi,

is there a nice way to run my script on the tchain on the local batch

I’m not 100% sure of what you mean, but if you want to avoid accessing data over the network you have to copy it to the machine where you want to process it, and then run on the local files.
You can use xrdcp or just scp to copy the files.

Assuming the files are all of similar sizes, and that 40 seconds are a reasonable runtime for one of them, then we can estimate it will take 40*31 seconds to process 31 files, which is about 21 minutes. So the runtimes you measured look pretty reasonable…?

Cheers,
Enrico

P.S.
I don’t understand your second post anymore: TTree::Draw took 2 minutes running on what exactly? The 31 files over the network?

dabbott · November 21, 2018, 1:38pm

Hi Enrico,

Regarding the TTree::Draw taking 2 minutes was referring to another older script I had run in the past, which called Draw() on a RResultPtr< TH1D>. I just ran this as a comparison. I believe the difference here is the tchain. These files are local so accessing the data should not be a problem. Thus I now believe the problem here is calling functions on the RRseultPtr< TH1D> with a number of files in the tchain (i.e. time spent running ~ number of files).

When I said “run on the local batch” I meant I wanted to run on my institutions batch system (i.e. using qsub to submit a job to the batch system). I know from past experience this was a simple process for tselectors with the help of PROOF (see https://root.cern.ch/using-tselector-proof).

Dale

eguiraud · November 21, 2018, 1:59pm

Hi,
uhm if a script that runs on local files and does less work takes 2 minutes, that says nothing on how long a different script that runs on many remote files should take

So at this point I see nothing wrong with your script taking 20 minutes to run (and the longer runtimes you experienced might have very well caused by a slow network).

Now if you are asking how to speed up execution when you are limited by bandwidth, one way is certainly to run on data on a local disk rather than access it via the network, especially if you know you are going to run on the same dataset many times (you’ll pay the cost of the network access once).

PROOF is a facility to distribute execution of ROOT jobs on multiple machines.
Running on several machines at once is in fact another way to divide the runtime cost of accessing the network.
RDataFrame does not support distributed execution (yet, there is work in progress).

If you don’t mind running on a single machine, then you can certainly submit execution of an RDF analysis to the job queue of your institution: a RDF analysis is a program like any other, if ROOT is installed on the batch system and your data is reachable from it, the analysis should just run.

Hope this clarifies things a bit.
Cheers,
Enrico

system · December 5, 2018, 2:00pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.