Head scratching issues with RDataFrame

RENATO_QUAGLIANI · December 10, 2019, 8:39am

Dear expert,
It’s going to be hard to share some code on this topic.
I just have a simple question before jumping into some debugging session
Is there a specific reason why RDataFrame in all tutorials is never constructed with

auto df = new RDataFrame(  TTree )?

We are loading ntuples with a TChain and using xrootd protocol (local cluster loading tuples from eos)

When 2 jobs are running in parallel on the condor system , and having called

EnableImplicitMT(10)

Before lunching the application we print the TChain.GetEntries() . and in both jobs the value is the same, so apparently all entries are loaded in the TChain object passed to construct the RDataFrame.

In one job the value returned from a filter is X, and in another is Y.
I wonder if

there is an issue to use pointers to DataFrames
Do you think the problem has to be searched into network instabilities and how MT is working when TChained TTrees are used in a MT application?

Thanks.
Renato

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

eguiraud · December 10, 2019, 9:53am

Hi,

there is an issue to use pointers to DataFrames

No, there shouldn’t be.

apparently all entries are loaded in the TChain object

Just a note: nothing is loaded – the number of entries is readable just by checking the headers (the first bytes) of each TTree.

Do you think the problem has to be searched into network instabilities and how MT is working when TChained TTrees are used in a MT application?

MT should work just fine when reading data through xrootd. What exactly are you comparing? Note that when running with multiple threads “entry number 2” for one event loop might not be the same as “entry number 2” for another run.

Cheers,
Enrico

RENATO_QUAGLIANI · December 10, 2019, 9:55am

So,
A PHD student has run jobs over the weekend.
What we get out is that in 2 jobs which should have had the same number out of :

df.Filter(cut).Define("weight","1.").Sum<double>("weight")

are different and they both use the same cut

The original DF is created using a list of files. We are hunting for the problem, if it comes from the fact the 2 jobs have been ended up using 9/10 files or if there is something more nasty going on.

eguiraud · December 10, 2019, 10:14am

How big are the files, how big are the results of the Sum and how big is the difference between the results? I.e. could it be a floating point truncation issue?

If you get a Report from the filter, doesn’t it report the same number of entries evaluated and passed in the two cases?

If you reduce this to something we can run ourselves, feel free to open a Jira ticket.

Cheers,
Enrico

RENATO_QUAGLIANI · December 10, 2019, 10:30am

The files are huge

O(4677695.0) entries or even more

I am trying to reduce the problem and get it reproducible.
What i noticed in those jobs is that sometimes they consume a lot of memory going into Swap.

RENATO_QUAGLIANI · December 10, 2019, 10:31am

For example 4 jobs that should have the same result value :

versions = ['0612_uptoL0_forDTFfit_preMVA', '0712_uptoL0_forDTFfit_postMVA', '0612_uptoL0_forNODTFfit_preMVA', '0712_uptoL0_forNODTFfit_postMVA']
results    = [4678966.0, 4655927.0, 4654611.0, 4678966.0]

eguiraud · December 10, 2019, 10:58am

Uhm weird, the numbers do not seem large enough for double to lose that much precision…the Report should give interesting information (remember to give a name to the Filter for it to appear in the report).

Also interesting: are the results stable for a given dataset? If two different runs on the same dataset yield two different results, this might hint to a problem in RDF.

EDIT: it seems the filter is basically cutting no entries, doesn’t it?

Cheers,
Enrico

RENATO_QUAGLIANI · December 10, 2019, 2:20pm

Hi @eguiraud, we have re-run the test on lxplus and things are all consistent now.
The jobs showing this behaviour were executed on a cluster outside CERN and loading files from eos/lhcb.
The network was not perfect so i am asking again if there might be some problem in doing what we do.

In few words :

We have our ntuples spread in various folders.

0/Tuple.root
1/Tuple.root
.....

Tuple.root typically contains :

DecayTree
MCDecayTree;1
MCDecayTree;2

What we do in the application is to build the
TChain object simultaneously loading “DecayTree” and “MCDecayTree” (2 dataframe existing at the same time)

And the .root files are loaded in a loop for both TChains.

Then we boookkep operations on both and we trigger first the EventLoop on DecayTree and then MCDecayTree.

My suspects are :
1- If files are loaded via xrootd protocol and 2 RDataFrame are kind of linked to same TFiles, do we need to expect some instabilities?
2- If the network has some issues is RDataFrame checking that while processing some files and entries do not get missed ?
3 - is the fact we have multiple names with same Key + xROOTD a potential source of issues?
Unfortunately i am not able to reproduce the problem as i have not acceess to the external cluster and on lxplus all seems to work as expected.
Thanks

eguiraud · December 10, 2019, 3:13pm

What do you mean? You mean calling TChain::Add in a loop? That does not load any data into memory – and it’s fine to do.

Each RDataFrame will use its own TFiles, they will not step on each other’s TFiles
No, RDataFrame (or more precisely TThreadExecutor) basically creates a TTreeReader per task and loops over the events of that TTreeReader object. Data sanity checks should come from TFile, TTree or TTreeReader. If no warnings/errors are printed to screen, you should assume there were no errors in reading the data
I don’t know about this one. Maybe @pcanal or @Axel can help.

Cheers,
Enrico

pcanal · December 10, 2019, 3:36pm

No, this is normal/usual.

pcanal · December 10, 2019, 3:39pm

Note that this is ‘good’ for debugging but usually a source of performance lost because it needs to open all the files and the TTree header for each of them to acquire this information.

As a side note, do you still see the inconsistency without that call?

Also did you try running:
a) locally, multiple times in row. Do you get consistent result
b) without multi-threading
c) with less threads.

Cheers,
Philippe.

StephanH · December 11, 2019, 12:52pm

One more question to @RENATO_QUAGLIANI.

How do you build the chains? Do you ls / stat / shell-glob the files, and hook them into the chain, or do you ship a list of files that have to be there? In the first case, you could miss a file because of network issues, so it never gets processed. In the second case, you should see errors.

Also a question to @pcanal and maybe to @RENATO_QUAGLIANI for testing:
Long time ago, I suffered from a bug (no idea if it’s still there) that when asking for the number of events in a TChain before processing (either it was GetEntries() or it was GetEntriesFast()- you might try both), you would get a number N. The process of retrieving that number would, however, leave a number n in some buffer (I think the number of events in the last file), and reading the chain would then silently stop after n events. That should be fixed since a long time, as it would bite a lot of users, but does that ring any bell?

RENATO_QUAGLIANI · December 13, 2019, 11:53am

Hi,
So what we do is :

/eos/..../eos.list is a teext file we xrdcp locally before loading tuples with a call :

system( 'xrdcp ... )

This text file contains the file paths:
/eos/...../file{1,2,3,4}.root
for example .

We go through the local file to load the list of names and then we
TChain::Add("root:://eoslhcb.cern.ch//eos/…/file{1,2,3,4});
Finally the TChain is passed to the DataFramee .

DataFrame  df( chain) ;
and sometimes we do : 
if( chain!= nullptr) {
    dataFrame = new RDataFrame( *chain);
}

This when things are loaded from remote.
On lxplus we use direct access without the
root:://eoslhcb.cern.ch pre-pended to the filename.

I tried with/without MT locally on lxplus, without the xrootd access (not diffs). On remote I am finding a way to do that.

Thanks for the reply. II try to update the thread as soon as i have news.

RENATO_QUAGLIANI · December 13, 2019, 11:59am

I do have another question, might be dummy , we found some time ago that we had to set :

export OPENBLAS_MAIN_FREE=1

to allow MultiThreading to find the number of cores in the machine the jobs were running over.
In some cases ROOT::EnableIMplicitMT() was setting ncpu==1 (some other times ncpus==10)
I do wonder if something can come from there as well.

in a C++ executable is it safe to switch on/off MultiThreading?
I.e

for( const auto   & sample : mysamples) { 
  ROOT::EnableImplicitMT(); 
 /*
  Do stuff on sample [ load TCHain , make RDataFrame, process, erport output] 
 */
  ROOT::DisableImplicitMT(); 
  /*
   do some other stuff
  */
}

StephanH · December 16, 2019, 3:45pm

Hi Renato,

Did you mean something like:

RDataFrame* df;
df = new RDataFrame(*chain) ;
... 
if( chain!= nullptr) {
    // Note that here you have a memory leak:
    df = new RDataFrame( *chain);
}

? In your example, you are assigning a pointer to an object on the stack. That should give an error, but if it works for some reason, this might be a source of problems.

Some more questions:

Does the code abort if one of the files is not found? I am not sure what happens if one of the files fails to copy.
Why is copying necessary? ROOT should be able to directly read from eos. It looks like the copies are not read when the chain is constructed or did I miss something about the paths on eos?

Yes, this can make a difference. ROOT uses as many threads as it finds logical cores on the machine. If you are summing over a lot of floating point numbers, the order in which you are summing can make a big difference. In particular, summing in 10 threads vs. 1 thread will reorder the summing operations. However, as Enrico pointed out here:

The differences between the numbers are quite large. This only happens when you have to sum numbers with very different ranges (e.g. 1. + 1.E-16). This leads to catastrophic cancellation, so you will not see the 1.E-16 part. I could imagine that in very specific circumstances, this could make an observable difference. That should be easy to test by explicitly specifying the number of threads, ROOT::EnableImplicitMT(N), and running N=1 and N=10 etc. against each other on lxplus.

I guess you can do that, but the dataframe must be done running the event loop.

system · December 30, 2019, 3:45pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.