Multithread CopyTree with cut

Hi Everybody,

I would highly appreciate having an explanation to the following issue, at least to know what I am doing wrong.
I succeeded to easily use the multithread ability of Root/TTree till I faced the following issue: CopyTree with Cut (see attachement).
a) with empty or simple cut the CopyTree works in multithread modus, very nice.
b) with complicated cuts I can only get one thread working (checked with several programs).
Did someone experience such behavior?

Thanks for any hint,
Faouzi
PS:
OS: Ubuntu 20.04 LTS
Desktop: Mate
Root: v6.24.00
CPU: AMD Ryzen Threadripper 24-Core Prossecor (48 threads)
RAM: Corsair 64 GB DDR4

thread-ex.cpp (2.5 KB)

@pcanal please correct me if I’m wrong: with a selection, there is a call to TTreeFormula::EvalInstance() for every event, which prevents effective multi-threading.

For better use of the cores, you can try using an RDataFrame for the selection and to write the result tree with RDataFrame::Snapshot(). Bear in mind though that because of multi-threading, the order of events in the output tree is likely to be different from the input tree.

Hi Jacob,

thanks for the prompt answer.

I tried without TTreeFormula to cross check, no changes.

I will give your idea a try and come back to the forum.

But still what surprised me is the effect the Cut has!

Cheers,

Faouzi

Hi Jacob,

I looked carefully into your solution but unfortunately I cannot use it cause I will have to drop the TTree version in the whole code for that. But still it is good to see another way of tree handling. Do you have full multithread response of the machine with RDF?

Thanks again,
Faouzi

cause I will have to drop the TTree version in the whole code for that

Not sure what you mean. Indeed you’d have to rewrite your cut, but it would be way faster and multi-threaded.

To do what you try to do in a multi-threaded way, you’d have to open the file separately in each thread, get the tree separately, apply the cuts separately, and write output separately per thread.

Hi Axel,

I think my problem won´t work with what you mentioned. I owe you a better description:

a) I have a TTree root file containing trajectories (in x-y coordinates = f(time)) of thousands of objects, let´s call it mainTree.

b) A given object M (Master) is selected and its trajectory (that may contain 1440 Points) is extracted. To evaluate primary interactions, we have to select Objects S (Slaves) whose trajectories closing to the M trajectory, let us say 1 Angström around. So the Master trajectory will get an –Angström- envelope (with its 1440 points) that will be used as a cut (envCUT) to reduce the mainTree into a smaller one –call it- selTree using:

selTree = mainTree->CopyTree(envCUT).

c) Now Events will be extracted from selTree (looping over selTree->GetEntry(i)) for further analysis.

The issue:

  1. With simple cut (only couple of operators) the multithread is performing very well in (b) and (c). I can see all the 48 threads nicely running in parallel.

  2. The 1440 trajectories-envelop-points mean 1440 operators that TTree/TCut cannot digest.

Solution: TTreeFormula::SetMaxima().

Results: the multithreading stops in (b) but keeps running in (c).

  1. Reducing the number of operators just enough to avoid using TTreeFormula::SetMaxima(), the cut still containing about 45 operators, does not help: single-threading in (b) and multithreading in (c).

With my email I just expected that somebody run into such issue.

In any case I will dig further till I get a satisfying answer/solution.

Thanks a lot.

Faouzi

Hi,
I’m afraid the scope of the question is too broad and relates more to the application discussed than to ROOT features we can help with.

In case you (@kolinc or @Attallah ) need support with a specific ROOT problem don’t hesitate to open a new thread.

In general, as Axel mentioned, in order to perform multi-thread reading of TTrees you will need to open a different copy of the TTree from a different TFile object in each thread (TFiles and TTrees are not thread-safe). RDataFrame is a high-level data processing interface that simplifies much of this.

Cheers,
Enrico

Hi Enrico,

I already answered Axel in details. The multithread is fully functioning with simple cuts but stops when the cuts get “complicated”, that´s the issue.

Because of my teaching load (till mid-July) I did not have the necessary time to dig into this direction but I will, even if I will have to change the whole code to accommodate RDataFrame.

If possible please leave the issue/thread open for the coming weeks.

Many thanks in advance,

Faouzi

How are you doing multi-thread reading exactly, and how does it stop working exactly? Degraded performance, crashes…?

Cheers,
Enrico

Hi

I posted in my first email an example with different cuts.

I have an AMD Threadripper (48 Threads) based PC:

a) by full multithreading, I can see (cross checked with different CPU-Load-Viewers) all 48 CPU fully loaded: great!

b) when the cut got “complicated”, I see just one single CPU fully loaded (100%, cross checked also).

That is my dilemma, the cut-effect!

Cheers,

Faouzi

PS: I am an old “regular” user of cern tools since 1989, and since then I never encountered an unresolved issue!

Sorry I meant to ask what is your multi-threading scheme, are you opening a different TTree from a different TFile in each thread? Do you have any locks or mutexes anywhere?

Hi,

Not that complicated, I am just using one TTree from one TFile. Once loaded, I am running several analysis cuts in a loop: 1st Cut, extracting data for further analysis, then 2nd Cut, … and here I noticed the performances loss. So I decided to investigate the case until I found out the “reason” that I checked with an even simpler code (I posted) : no locks, no mutexes.

Cheers,

Faouzi

Where does the multi-threading come into play?

Ah wait – if ROOT::EnableImplicitMT is the multi-threading you refer to, that only parallelizes some parts of the I/O. Everything else is still running single-thread. If those parts of the I/O are not where most of the CPU time is spent, you won’t see a particular advantage.

You can check where the program spends time with tools such as perf on Linux, or vtune.

Cheers,
Enrico

All right, I will check that.

Thanks,

Faouzi