EnableImplicitMT() prevents reading nested TTree from XrootD file with RDataFrame

mwilkins · March 23, 2022, 5:15pm

Continuing the discussion from EnableImplicitMT() prevents reading XrootD file with RDataFrame:

This issue was only partially resolved. It still occurs for a TTree inside a TDirectory. Reproducer:

import ROOT as r

r.ROOT.EnableImplicitMT()

fpath = "root://path/to/test.root"
r.RDataFrame(10).Define("e", "rdfentry_").Snapshot("testd/testt", fpath)
opts = r.RDF.RSnapshotOptions()
opts.fMode = "UPDATE"
r.RDataFrame(10).Define("e", "rdfentry_").Snapshot("testt", fpath, "", opts)

f = r.TFile.Open(fpath)
t = f.Get("testt")
td = f.Get("testd/testt")
rdft = r.RDataFrame(t)
rdftd = r.RDataFrame(td)
ht = rdft.Histo1D("e")
htd = rdftd.Histo1D("e")

ht.GetMean()  # works fine
htd.GetMean()  # produces error

Output:

---------------------------------------------------------------------------
runtime_error                             Traceback (most recent call last)
Input In [3], in <module>
     15 htd = rdftd.Histo1D("e")
     16 ht.GetMean()  # works fine
---> 17 htd.GetMean()

runtime_error: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    runtime_error: TTreeProcessorMT::Process: an error occurred while getting tree "//path/to/test.root:/testd/testt" from file "root://path/to/test.root"

ROOT Version: 6.24/06
Platform: Not Provided
Compiler: Not Provided

danj1011 · March 23, 2022, 5:22pm

RDataFrame with a tree inside a directory in files used to populate a TChain seems to be working for me, but all I’ve tried so far is to load the RDataFrame and then run a ForEachSlot with a custom lambda.

But I don’t see a huge amount of MT activity so I wonder if the TChain source for the RDataFrame is limiting the MT aspect…

eguiraud · March 23, 2022, 6:19pm

Hi @mwilkins ,
thank you for the report, I was not aware of this problem. This is now [DF] EnableImplicitMT() prevents reading TTree in sub-directory from XrootD file · Issue #10216 · root-project/root · GitHub , I will try to work on a fix in time for 6.26.02, which should be out in O(week).

Cheers,
Enrico

eguiraud · March 23, 2022, 6:26pm

Thank you @danj1011 , you really need the combination of root://, EnableImplicitMT() and tree in subdirectory for this problem to occur, maybe you are missing the filename starting with root://?

No, TChain won’t limit the multi-thread scaling. It might be a few things, three that I can think of off the top of my mind:

RDataFrame parallelizes over “clusters” of entries (visible e.g. with tree->Print("clusters"), which is the unit of compression/decompression in TTree. If, for example, the TChain is composed of 10 trees and each tree is small and only has 2 clusters (for a total of 20 units of parallelization), you won’t see good scaling above a few cores (2-4) – generally we try to give each core 4 to 10 clusters to work with to have good workload balance
I/O bandwidth acts as a bottleneck, so all threads spend a lot of their time waiting for data to arrive
if you are using many threads (>100) and total runtimes of your application are in the order of seconds, up until v6.26.00 you might spend most of your time in some initialization overhead – that we now removed in master and we’ll try to backport those performance improvements to v6.26.02 as well

Let me know (in another thread) if you think you should get better CPU usage than you are getting – if you have a reproducer we can take a look on our side.

Cheers,
Enrico

eguiraud · March 23, 2022, 6:54pm

As a workaround you can pass treename and filename directly, e.g.:

auto rdftd = ROOT::RDataFrame("testd/testt", "root://eosuser.cern.ch//eos/user/e/eguiraud/scratch/test.root");

This should work.

eguiraud · March 23, 2022, 7:49pm

This patch should fix the problem, please try it out if you can (or try a ROOT nightly build in a couple of days) and let me know if you see further problems.

Cheers,
Enrico

mwilkins · March 23, 2022, 8:09pm

Unfortunately, I don’t have a good way to build ROOT from source right now (and the link to the nightlies in your post does not seem to be working).

eguiraud · March 23, 2022, 8:25pm

(fixed the link)

danj1011 · March 24, 2022, 4:04pm

Thanks @eguiraud ! I had root://, EnableImplicitMT() and a tree in a subdirectory, but I’m using 6.24.06 so I don’t know why I don’t see any problems.

Great that the TChain shouldn’t interfere. Indeed, the bottleneck was xrootd access to remote sites. I did see a huge improvement going from 100 threads TO 1 thread (!) [i.e. 6 minutes → 2 seconds] so perhaps this initialization overhead is manifested here. I’ll be pleased to try out 6.26.00 when it’s available through the mainstream LCG views.

eguiraud · March 24, 2022, 4:37pm

That’s surprising, that case was really broken

On one hand, if one thread takes 2 seconds to process your data, you really shouldn’t throw 100 threads at it, they will not have enough “meat” to divide between themselves. On the other hand, that slowdown is terrible. You should see some overhead, not that. I hope the situation will be a lot better with v6.26.02 (a patch release that is coming out in O(week), which will include some scaling improvements. But v6.26.00 should already behave better than v6.24.06). If not, please let us know

Cheers,
Enrico

danj1011 · March 24, 2022, 4:49pm

Yes, I’m puzzled!

As for the threads, I agree, but I came at it from the opposite direction (100 threads, then tried 1 ). When the LCG view is released I’ll give it a go. Thanks!

eguiraud · March 24, 2022, 4:57pm

I would like to understand this better, but let’s stop hijacking this thread (sorry @mwilkins !). I’ll send you a few questions in a private message on the forum.

danj1011 · March 24, 2022, 5:09pm

Yes, sorry @mwilkins (though, in my defence, @mwilkins did ask me offline to chime in here )

system · April 7, 2022, 5:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.