EnableImplicitMT() prevents reading nested TTree from XrootD file with RDataFrame

Continuing the discussion from EnableImplicitMT() prevents reading XrootD file with RDataFrame:

This issue was only partially resolved. It still occurs for a TTree inside a TDirectory. Reproducer:

import ROOT as r

r.ROOT.EnableImplicitMT()

fpath = "root://path/to/test.root"
r.RDataFrame(10).Define("e", "rdfentry_").Snapshot("testd/testt", fpath)
opts = r.RDF.RSnapshotOptions()
opts.fMode = "UPDATE"
r.RDataFrame(10).Define("e", "rdfentry_").Snapshot("testt", fpath, "", opts)

f = r.TFile.Open(fpath)
t = f.Get("testt")
td = f.Get("testd/testt")
rdft = r.RDataFrame(t)
rdftd = r.RDataFrame(td)
ht = rdft.Histo1D("e")
htd = rdftd.Histo1D("e")

ht.GetMean()  # works fine
htd.GetMean()  # produces error

Output:

---------------------------------------------------------------------------
runtime_error                             Traceback (most recent call last)
Input In [3], in <module>
     15 htd = rdftd.Histo1D("e")
     16 ht.GetMean()  # works fine
---> 17 htd.GetMean()

runtime_error: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    runtime_error: TTreeProcessorMT::Process: an error occurred while getting tree "//path/to/test.root:/testd/testt" from file "root://path/to/test.root"

ROOT Version: 6.24/06
Platform: Not Provided
Compiler: Not Provided


RDataFrame with a tree inside a directory in files used to populate a TChain seems to be working for me, but all I’ve tried so far is to load the RDataFrame and then run a ForEachSlot with a custom lambda.

But I don’t see a huge amount of MT activity so I wonder if the TChain source for the RDataFrame is limiting the MT aspect…

Hi @mwilkins ,
thank you for the report, I was not aware of this problem. This is now [DF] EnableImplicitMT() prevents reading TTree in sub-directory from XrootD file · Issue #10216 · root-project/root · GitHub , I will try to work on a fix in time for 6.26.02, which should be out in O(week).

Cheers,
Enrico

Thank you @danj1011 , you really need the combination of root://, EnableImplicitMT() and tree in subdirectory for this problem to occur, maybe you are missing the filename starting with root://?

No, TChain won’t limit the multi-thread scaling. It might be a few things, three that I can think of off the top of my mind:

  • RDataFrame parallelizes over “clusters” of entries (visible e.g. with tree->Print("clusters"), which is the unit of compression/decompression in TTree. If, for example, the TChain is composed of 10 trees and each tree is small and only has 2 clusters (for a total of 20 units of parallelization), you won’t see good scaling above a few cores (2-4) – generally we try to give each core 4 to 10 clusters to work with to have good workload balance
  • I/O bandwidth acts as a bottleneck, so all threads spend a lot of their time waiting for data to arrive
  • if you are using many threads (>100) and total runtimes of your application are in the order of seconds, up until v6.26.00 you might spend most of your time in some initialization overhead – that we now removed in master and we’ll try to backport those performance improvements to v6.26.02 as well

Let me know (in another thread) if you think you should get better CPU usage than you are getting – if you have a reproducer we can take a look on our side.

Cheers,
Enrico

As a workaround you can pass treename and filename directly, e.g.:

auto rdftd = ROOT::RDataFrame("testd/testt", "root://eosuser.cern.ch//eos/user/e/eguiraud/scratch/test.root");

This should work.

This patch should fix the problem, please try it out if you can (or try a ROOT nightly build in a couple of days) and let me know if you see further problems.

Cheers,
Enrico

Unfortunately, I don’t have a good way to build ROOT from source right now (and the link to the nightlies in your post does not seem to be working).

(fixed the link)

Thanks @eguiraud ! I had root://, EnableImplicitMT() and a tree in a subdirectory, but I’m using 6.24.06 so I don’t know why I don’t see any problems.

Great that the TChain shouldn’t interfere. Indeed, the bottleneck was xrootd access to remote sites. I did see a huge improvement going from 100 threads TO 1 thread (!) [i.e. 6 minutes → 2 seconds] so perhaps this initialization overhead is manifested here. I’ll be pleased to try out 6.26.00 when it’s available through the mainstream LCG views.

That’s surprising, that case was really broken :sweat_smile:

On one hand, if one thread takes 2 seconds to process your data, you really shouldn’t throw 100 threads at it, they will not have enough “meat” to divide between themselves. On the other hand, that slowdown is terrible. You should see some overhead, not that. I hope the situation will be a lot better with v6.26.02 (a patch release that is coming out in O(week), which will include some scaling improvements. But v6.26.00 should already behave better than v6.24.06). If not, please let us know :slight_smile:

Cheers,
Enrico

Yes, I’m puzzled!

As for the threads, I agree, but I came at it from the opposite direction (100 threads, then tried 1 :wink: ). When the LCG view is released I’ll give it a go. Thanks!

I would like to understand this better, but let’s stop hijacking this thread (sorry @mwilkins !). I’ll send you a few questions in a private message on the forum.

:laughing: Yes, sorry @mwilkins (though, in my defence, @mwilkins did ask me offline to chime in here :wink: )

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.