Multithreading and xrootd

Dear experts,

I’m trying to write a script that processes a TChain or TTree using multiple threads, and writes out a new trimmed tree. I’m aware that there are several ways of doing this with ROOT, but pointers to better solutions than the following prototype are very welcome.

Currently, i have a script (attached) that trims some branches from the input chain/tree using

ROOT::TProcessExecutor pex(n_workers);
pex.Map(loop_and_fill,ROOT::TSeqU(n_workers));

Within the loop_and_fill lambda, a new TChain is created. If the chain reads local files, everything works fine, but the execution hangs (in a single thread with 100 % CPU) as soon as i’m trying to read from eos with xrootd.
What is the correct way of opening the file(s) with xrootd?

Many thanks in advance,
Marian

tree_trimmerMT.cpp (5.0 KB)


ROOT Version: 6.12.06
Platform: x86_64-slc6-gcc62-opt
Compiler: g++ version 6.2.0


Hi Marian,
TProcessExecutor spawns multiple processes, for multi-thread reading of ROOT files your best options are either TDataFrame or, at a lower level, TTreeProcessorMT (which is what TDataFrame uses internally when running multi-thread event loops).

Both comes with tutorials, in the “Data Frame” and “Multicore” sections here.

Hope this helps!
Enrico

EDIT: as per why your current implementation hangs, one easy way to find out is to attach a debugger to the process, press ctrl-C and check the stacktrace.

Hi Enrico!
Thanks for the quick reply!
I’m not familiar with the technical details of multi-threading/-processing, so i’m agnostic and happy with anything that speeds up my ntuple processing :slight_smile:

I see that TTreeProcessorMT needs a TTreeReader object. I avoided this up to now, since the code should be independent of the actual type of branches. For example, in my current use-case the input are mostly flat (up to some C-style arrays) ntuples with up to 1k branches. For further analysis, i want to apply some cuts, add a couple of new branches, but also keep a few hundred branches which should just remain as they are.
In my current implementation this is conveniently solved by

wchain->SetBranchStatus("*",0);
for(const auto& var : *trim_vars)//loop over strings in a config file
  wchain->SetBranchStatus(var.first.data(),true);
...
auto otree = wchain->CloneTree(0);

Is there a way to achieve this with TTreeReader to be able to use TTreeProcessorMT?

For now i’m trying to to fix my implementation by having a closer look at ROOT::TThreadedObject< T >, which appears to be essential for TTreeProcessorMT.

For completeness, here is the backtrace from gdb of my current implementation:

#0  0x00007ffff15413b3 in select () from /lib64/libc.so.6
#1  0x00007ffff7a5dc32 in TUnixSystem::UnixSelect(int, TFdSet*, TFdSet*, long) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/ROOT/6.12.06/x86_64-slc6-gcc62-opt/lib/libCore.so
#2  0x00007ffff7a5e5b5 in TUnixSystem::DispatchOneEvent(bool) () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/ROOT/6.12.06/x86_64-slc6-gcc62-opt/lib/libCore.so
#3  0x00007ffff798f774 in TSystem::InnerLoop() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/ROOT/6.12.06/x86_64-slc6-gcc62-opt/lib/libCore.so
#4  0x00007ffff6df3263 in TMonitor::Select() () from /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/ROOT/6.12.06/x86_64-slc6-gcc62-opt/lib/libNet.so
#5  0x000000000041740f in void ROOT::TProcessExecutor::Collect<unsigned int>(std::vector<unsigned int, std::allocator<unsigned int> >&) ()
#6  0x00000000004203f9 in main ()

Cheers,
Marian

EDIT: Using ROOT::TThreadedObject< T > (attached) does not solve the issue. In fact, even processing the local files rises problems (see below), but runs through.

Error in <TBranch::GetBasket>: File: /auto/data/mstahl/DfromBtuples/XcPi_2011MagUp.root at byte:16380099745, branch:Xb_IPCHI2_OWNPV, entry:348190, badread=0, nerrors=9, basketnumber=93
 file probably overwritten: stopping reporting error messages
===>File is more than 2 Gigabytes
 file probably overwritten: stopping reporting error messages
===>File is more than 2 Gigabytes
selector_rootMT      INFO:   [=================                                 ] (34% - ETA: 35s - 4 threads)R__unzipLZMA: error 9 in lzma_code
Error in <TBasket::ReadBasketBuffers>: fNbytes = 12967, fKeylen = 86, fObjlen = 31912, noutot = 0, nout=0, nin=12881, nbuf=31912
Error in <TBranch::GetBasket>: File: /auto/data/mstahl/DfromBtuples/XcPi_2011MagUp.root at byte:16952478718, branch:Xc_K2_M, entry:235661, badread=1, nerrors=1, basketnumber=63

tree_trimmerThreadedObj.cpp (5.1 KB)

Hi @marian,
you are describing exactly the kind of processing TDataFrame (now renamed RDataFrame in master) is made for:

import ROOT
ROOT.ROOT.EnableImplicitMT() # enables implicit multi-threading

# open tree 't' in file 'f.root' for processing
df = ROOT.RDataFrame('t', 'f.root')
# apply some cuts
cut_df = df.Filter('x > 2 && x < 11').Filter('x % 2 == 0')
# add some new branches
df_with_defines = cut_df.Define('y', 'x * x').Define('z', 'y * x')
# write out the required branches
good_branches = ROOT.std.vector('string')()
for name in ('x', 'y', 'z'):
    good_branches.push_back(name)
df_with_defines.Snapshot('t', 'newfile.root', good_branches)

You can check out the documentation here.

Hope this helps,
Enrico

Hi Enrico,
Unfortunately i experienced several issues with TDataFrame. I used it like

ROOT::Experimental::TDataFrame d(*chain.get(),default_branches); //default_branches is a vector of strings
d.Snapshot(configtree.get("outtreename",itn).data(),ofn.data(),default_branches);

dataframetest.cpp (3.0 KB)

First, the resident memory went up to 1.5 GB. It usually is ~200 MB for local input data with the first prototype. Apart from that, several errors occured:

Error in <TBranch::TBranch>: Illegal leaf: Xb_Cons_M/Xb_Cons_M[Xb_Cons_nPV]/F
...
Error in <TNetXNGFile::TNetXNGFile>: The remote file is not open
Error in <TObject::GetBasket>: File: root://eoslhcb.cern.ch//eos/lhcb/wg/BandQ/EXOTICS/Lb2LcD0K/DfromBBDTs/ntuples/Xb2XcPi/XcPi_2011MagUp.root at byte:0, branch:TObject, entry:120132, badread=1, nerrors=1, basketnumber=25
...
Info in <TObject::SaveAs>: C++ Macro file: Xc_K1_TRACK_nFTHits has been generated
 *** Break *** segmentation violation
...
Error in <TNetXNGFile::Close>: [ERROR] Invalid operation
...
Segmentation fault (core dumped)

Could it be that all errors are just due to the first one, i.e. the problem with an array-type branch?

Thanks,
Marian

Hi @marian,

  1. in your current multi-process implementation, the gdb stacktrace you posted shows that the main process is stuck waiting for results from the worker processes (ROOT::TProcessExecutor::Collect, select () are polling functions). It’s possible that the workers crashed without ever returning a result. htop or similar tools can show whether this is the case, or you can simply put printouts in the task executed by the workers
  2. yes, TThreadedObjects are not mandatory but help a lot when using TTreeProcessorMT: they make it easy to handle thread-local objects
  3. it would probably be worth it to understand why you get those errors in sequential executions: parallel processing is tricky and will probably not work correctly if the unit of computation already prints warnings/error messages when you run it sequentially
  4. TDataFrame uses more memory because each thread is opening its own files and processing its own data. To a degree, it is expected, it’s the usual space-time tradeoff often experienced with parallel applications. In particular, when using Snapshot each thread fills a queue with processed buffers to be written to the output file, and if your disk does not keep up these buffers can accumulate in memory (see also ROOT-9133). The simplest solution in case the memory usage is excessive is to reduce the number of threads you use, as you are anyway bandwith-limited and it might be that the extra threads are not buying you much in terms of runtime
  5. regarding the error messages you get when running your TDataFrame application, they seem to be independent of one another: one is a TBranch complaining about a malformed leaf, another one is TNetXNGFile having problems with the remote file (which in turn probably causes the error in TObject::GetBasket). I cannot run dataframetest.cpp as I’m missing the data and some headers you use. It would be extremely useful if you could reduce this to a small reproducer that only uses ROOT components (e.g. no IOjuggler) and accesses public files so I can run it and try to understand what’s wrong

Cheers,
Enrico

Hi Enrico,
Thanks a lot for your detailed answer! After breaking dataframetest.cpp down to the bare minimum, i managed to find a solution that works. The problem was indeed only due to the illegal leaf which needed to be transformed. Sorry for the noise concerning this.

There are still some problems with my current solution [dataframetest.cpp (3.9 KB)] concerning local/xrootd files in terms of speed and memory usage. top displayed:

xrootd:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                           
17327 mstahl    20   0 5163m 2.5g  90m S 46.3  1.0   2:59.69 dataframetest
local:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                           
17677 mstahl    20   0 5253m 2.7g  88m R 931.3  1.1   5:04.88 dataframetest 

I don’t understand the timing numbers there, but using remote input took several minutes, while using local input took ~30s in real time.
Probably the timing can be improved by not using my cumbersome implementation of Define actions to transform the variables.
Do you have any suggestions on how to do this properly? I’ll try to use a single Foreach or ForeachSlot action, which presumably also makes it easier to debug the memory usage.

Some comments concerning your points:

  1. top only showed a single process with 100% CPU usage. When i run on local files, it happily runs with n_workers processes.
  2. Could this be due to memory resident trees? Is there a suggested workaround?
  3. Sorry for not providing the full package. I didn’t mean to hide it. Can you access https://gitlab.cern.ch/mstahl/ntuple-gizmo/tree/MTtt ? I will post another version later which can be used as root compilable script.

Cheers,
Marian

Hi Marian,
glad that you could get it working.

I don’t understand the timing numbers there

top reports the cumulative time of threads, not the wall-clock time! you can tell that the analysis is running in parallel since the CPU usage is at 931%.

using remote input took several minutes, while using local input took ~30s in real time.

what is the expected runtime for remote input? is ~30s a good runtime for local input compared to other methods?

Probably the timing can be improved by not using my cumbersome implementation of Define actions to transform the variables.

How many Defines are we talking about? It’s true that in v6.12 Defines and Filters with string arguments are a slow – the problem is resolved in master and the upcoming v6.14 release.
New quantities are defined with Define, there is no way around that.
What do you mean “debug memory usage”? It seems ok from your top copy-paste.

  1. top only showed a single process with 100% CPU usage

exactly! where are the worker processes? :slight_smile:

  1. Could this be due to memory resident trees? Is there a suggested workaround?

As I said, my guess is that you are hit by ROOT-9133, caused by the accumulation of buffers to be written out to disk in a queue. The workaround is using less worker threads, since anyway the accumulation of buffers means that you are limited by disk bandwidth and throwing more threads at the problem won’t help.

  1. I will post another version later which can be used as root compilable script.

Yes please! We need a minimal reproducer to get a clear picture of the problem, disentangled by other libraries. I must say though, I am not sure what is the problem now.

Cheers,
Enrico

Hi Enrico,
The 931% CPU usage means that TDataframe figures out on it’s own how much threads it uses? Does it also mean that the input file (for this test it’s only one) is read by about 10 threads? I think this is where my problem is: with xrootd, this parallel reading does not seem to be possible.

In a single thread (with or without xrootd), the test-task is executed in about 3 minutes. Using all available resources with the initial multi-processing script on local input, this goes down to ~7 seconds.
Using the ACLiC compiled versions of the script tree_trimmerTDF.C (4.9 KB) XiczPi.txt (4.2 KB)
takes about 2 minutes with local input (the 30s were too optimistic) and 15 minutes with xrootd input. I executed these scripts in the following way:

root [0] gROOT->ProcessLine(".include /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/Boost/1.66.0/x86_64-slc6-gcc62-opt/include")
root [1] .x tree_trimmerTDF.C++("root://eoslhcb.cern.ch://eos/lhcb/wg/BandQ/EXOTICS/Lb2LcD0K/DfromBBDTs/ntuples/Xb2XcPi/XcPi_2011MagUp.root","Xibm2XiczPiDTTuple/DecayTree","test.root","XiczPi.txt")

Currently there are only 2, but there might be up to several hundred in future applications. I assume that putting them into a single lambda like it was done in the multiprocessing script is the better way?

they dissolved without complaining. iirc they are spawned, but disappear almost immediately and the script just freezes.

I think the problem is still the initial one: reading a remote file with xrootd in multiple threads/processes. It seems as if TDataFrame can also only read with one thread.

I’m currently leaning towards the initial multi-processing script, and will provide an ACLiC-compilable version of this as well.

Cheers,
Marian

TDataframe figures out on it’s own how much threads it uses

No, by default it runs single-thread, if you specify ROOT::EnableImplicitMT() it runs multi-thread using as many threads as ROOT::GetImplicitMTPoolSize() (should be equal to the available amount of cores), but you can specify the size of the thread pool as ROOT::EnableImplicitMT(nThreads) if you need to.

Does it also mean that the input file (for this test it’s only one) is read by about 10 threads?

Yes, each thread opens the file and processes a part of the entries

I think this is where my problem is: with xrootd, this parallel reading does not seem to be possible.

Ok, I must say I’m not overly familiar with xrootd, why do you think so?

Using the ACLiC compiled versions of the script tree_trimmerTDF.C (4.9 KB) XiczPi.txt (4.2 KB)
takes about 2 minutes with local input (the 30s were too optimistic) and 15 minutes with xrootd input

We (ROOT team) need to look into this. I think the data-frame implementation in current ROOT master (and upcoming v6.14) fixed a couple of issues that might affect your runtimes.
In order to do so:

  • could you please provide a small reproducer that does not depend on IOJuggler?
  • is the data you are reading publicly available? (I can try to reproduce the issue with some other dataset but it would be nice to reproduce in the same conditions as you)

Currently there are only 2, but there might be up to several hundred in future applications.

2 defines cannot be an issue. hundreds of them will be slow to create (not to run, just to create) in v6.12, but that’s fixed in master and v6.14.

I assume that putting them into a single lambda like it was done in the multiprocessing script is the better way?

It shouldn’t be noticeably faster.

I think they crashed, since the main process hangs waiting for their results.

I think the problem is still the initial one: reading a remote file with xrootd in multiple threads/processes.

Ok, I think I’m missing a step: why do you think this is the problem, i.e. where is the smoking gun?
The multi-process script, if I understand correctly, hangs indefinitely waiting for results from worker processes which have aready exited. The TDataFrame script is very slow with remote files but is also quite slow on local files, so there might be another issue.

In the case of TDataFrame, could you try calling SetCacheSize() on the TChain before you pass it to the TDataFrame constructor? That should force the usage of a TTreeCache, which is disabled by default because of a (recently fixed) bug. This should greatly reduce the number of remote reads, possibly speeding up the analysis.

It seems as if TDataFrame can also only read with one thread.

Why do you say so?

I’m currently leaning towards the initial multi-processing script, and will provide an ACLiC-compilable version of this as well.

It would help a great deal if you could provide scripts that only depend on ROOT, not on other libraries that I would have to install/compile to try out the reproducers.

I’m currently leaning towards the initial multi-processing script, and will provide an ACLiC-compilable version of this as well.

We’ll make it work then :slight_smile: as I suggested before you could put some printouts in the worker tasks to try to figure out when the workers exit and why.

Cheers,
Enrico

Hi Enrico,

Thanks for the explanation. I took a closer look at the CPU usage, and it seems that TDataFrame has a phase where it fills memory, and runs at 100% CPU maximum, before it starts using more threads. By using TStopwatch, the TProcessExecutor implementation reaches Real time 0:00:27, CP time 0.520, while
TDataFrame scores Real time 0:00:58, CP time 287.770.

When reading a remote file, the CPU usage did not exceed 100%. Most of the time it was even below 50%.

In summary, i measured these numbers using 10 cores:

Class/input location local eos with xrootd
TDataFrame 58s 11m
TProcessExecutor 27s freezes

Ah, and to be fair: the TProcessExecutor doesn’t need to transform 2 variables, and doesn’t merge the results.

Here are the two ACLiC-compilable scripts including the config file which i used to get the numbers in the table
tree_trimmerTPE.C (5.8 KB) tree_trimmerTDF.C (5.0 KB)
XiczPi_TPE.txt (4.2 KB) XiczPi.txt (4.2 KB)
You can run the files in the root prompt (i had to include the corresponding boost version to make it work), e.g.

gROOT->ProcessLine(".include /cvmfs/lhcb.cern.ch/lib/lcg/releases/LCG_93/Boost/1.66.0/x86_64-slc6-gcc62-opt/include")
.x tree_trimmerTDF.C++(<input_file>,"Xibm2XiczPiDTTuple/DecayTree","test.root","XiczPi_TDF.txt")

I will send you a link per mail

This did not change anything on the timing. I used it for both entries in the table.

Thanks a lot for taking the time looking into this! I really appreciate your support!

I put the printouts before and after the workers crash (i.e. when adding the remote file to the chain).

Cheers,
Marian

Hi Marian,
thanks for the reproducers.

I will definitely take a look at why TDataFrame is so slow when reading over the network (beginning of next week at this point).

I will also check if I can find out why the worker processes of TProcessExecutor exit without returning results, but until proven otherwise this is a bug in tree_trimmerTPE.C so I would appreciate it if you could also take a look there :slight_smile:

Cheers,
Enrico

Hi Enrico,

Indeed, this is a likely scenario.
From some quick studies, i saw that adding the remote file to the chain on the workers is still ok and the file is not even nullptr. But when trying to print it, only the very first file (the one outside of the lambda) actually prints something useful. The rest doesn’t print anything.
I tried to play some tricks and let the threads sleep so that the files the worker chains tries to read are opened sequentially, but this didn’t help. I didn’t even manage to “detach/delete” the file on the master, so that at least the first worker can read the file.
Maybe the files are not meant to be opened in this way, and something prevents the same file to be opened twice on xrootd.

Have a nice weekend,
Marian

Hi Marian,
regarding the TDataFrame version of your code:
with ROOT master, it takes 30 seconds on my CERN workstation when reading from afs.
Indeed a few problems with network reading were fixed lately, so it would be interesting to see if you also see an improvement in master w.r.t. v6.12.
The first phase of execution with a single core working and memory usage increasing is to be expected: it’s cling just-in-time compiling your call to Snapshot with 217 template parameters!

Would you be able to try your tree_trimmerTDF.C on ROOT’s current master branch (you can get it from cvmfs)?

Here is a version of tree_trimmerTDF.C that compiles with current master: tree_trimmerTDF_master.C (5.4 KB)

Cheers,
Enrico

Hi Enrico,

I’ll try to run to run the new script as soon as i find some time.

Can you point me to it? I only get old versions skimming the directories documented in: https://root.cern.ch/how-setup-root-externals-afscvmfs

To get a better feeling for this: do you know how this compilation step is done when the script is (pre-?)compiled with g++, i.e. the way i use it in my usual environment?
Can this become problematic at some point? Because this script only provides the trimming part so far. I want to include more features later, where it can grow to 1k or more parameters.

Cheers,
Marian

Hi Marian,

for example, to get a nightly build of root master on lxplus7, I would do:

$ ssh eguiraud@lxplus7.cern.ch
[eguiraud@lxplus089 ~]$ source /cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-centos7-gcc7-opt/setup.sh 
[eguiraud@lxplus089 ~]$ root-config --version # verify ROOT version
6.15/01


Regarding Snapshot:
when you write df.Snapshot(outtree, outfile, default_branches_out) this tells TDataFrame to save to disk all of the branches present in the list of names default_branches_out. In your case this is a list of 217 column names. Under the hood, this generates code equivalent to the following, which cling compiles:

df.Snapshot<T1,T2,T3,T4,...,T217>(outtree, outfile, default_branches_out)

where T1...T217 are the types of the 217 columns. It takes several seconds for cling to digest that large template function call. It would take a similarly large time for g++ to digest it if you explicitly wrote the template parameters yourself in tree_trimmerTDF.C, with the difference that the time would then be spent at compile-time, not at runtime. This is what is going on during that phase with a single core running and RAM usage increasing: a large template function call is being just-in-time compiled.
We will be looking into reducing this overhead of course (thanks for digging up this corner case, we never benchmarked snapshotting of these many columns with TDF), but keep in mind that it is ~15s constant overhead that does not depend on the amount of data you have (only on the number of columns you want to snapshot).
Hope this makes it more clear.

Cheers,
Enrico

Hi Enrico,

Sorry for being slow. There are good news: With the master branch i see that xrootd files are now read from multiple threads. I ran the scripts again (this time on a larger machine) and saw that it took 44 seconds reading from local files and 87 seconds with xrootd. This could well be due to the network transfer rate. I’ll mark the issue as fixed then and will happily use RDataFrame from now on :slight_smile:

I still have to test if i also see the same behaviour with the g++ compiled script, but it would be magic if the compiler can know this at runtime, since the number of branches is fully determined in the config file (many of the things the compiler does are magic to me, including the JIT stuff which i haven’t heard of before)

In my case, almost all of the columns will just be doubles, floats and (unsigned) integers. Maybe this can help you in some way.

Thanks a lot,
Marian

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.