Read Data from RDataFrame with Implicit MultiThreading is Scrambled

Giorgio_Del_Castello · October 27, 2023, 11:00am

Hi,
I am developing an high level readout system using the RDataFrame in python for the files produced by our analysis framework (c++ & ROOT based). I build the RDataFrame starting from a TChain containing one or multiple TFiles and while playing with it I noticed a weird behavior.

Mainly when I run the AsNumpy method with the branches to read I notice inconsistencies on the result. In particular, if I run the command twice the results are different unless sorted, i.e. the set of numbers returned are the same but they appear in random order. Also if I read multiple branches the values of the different branches for the events appear scrambled around (meaning that the branches don’t give the results read in the same event order, mixing the values between events).

I noticed that this all disappears if I don’t give the command ROOT.EnableImplicitMT() (or if I give explicitly the ROOT.DisableImplicitMT()).

Is there a way to use multithreading without having the scambling of the data?

Thanks,
Giorgio

ROOT Version: ‘6.28/05’
Platform: Debian GNU/Linux 11 (bullseye)
Compiler: g++ (Debian 10.2.1-6) 10.2.1

mczurylo · October 27, 2023, 11:25am

Hi @Giorgio_Del_Castello,

welcome back to the forum and thank you for your question. Could you share your code (some minimal reproducer of your problem, if you prefer to send it privately, you can also drop me an email), so it’s easier for us to look into the problem?

Cheers,
Marta

Giorgio_Del_Castello · October 27, 2023, 1:32pm

Hi,
Yes i can share. The root file is filled with custom classes but i managed to reproduce the problem all with native root classes when doing the reading of the file.

I noticed now that it only happens when in the TChain i add more than 1 file (but in principle in my software I would like to have as many files as I want since the data is divided in smaller partials,with TTree structure, for simpler I/O).

You can find the code reproducing the problem and a couple of example datafiles at: https://we.tl/t-776MStfwQ0 (the files are a bit too big to share directly on the forum).

Thanks

Danilo · October 28, 2023, 7:21am

Hi Giorgio,

It is expected that any implicit ordering is not honoured in a multithreaded environment, e.g. the order in which events are processed and exposed through the various getters of RDataFrame or Snapshot.

At the risk of being pedantic, a concrete example. If you have N clusters, i.e. “groups of events compressed together”, in your files, you have N! possible outputs, which are all correct!

I hope I understood well the symptoms you are describing, if not please let us know.

Cheers,
Danilo

Giorgio_Del_Castello · October 28, 2023, 7:45am

Hi Danilo,
If I understood well your point that is not what is happening in my case. I don’t care about the absolute ordering of the events read, but there is no consistency in the ordering of the branches i need to read (in the same AsNumpy() command).

Let’s assume I have a root file built in the following way:
Ev| A | B
1 | a1 | b1
2 | a2 | b2
3 | a3 | b3
4 | a4 | b4

when i use multithreading what I would for example read is:

Ev| A | B
1 | a2 | b4
2 | a4 | b3
3 | a1 | b2
4 | a3 | b1

so basically what i get are independent permutations of the various columns i read, and the permutations appear to be different every time I read the RDataFrame.
This means that the single column always contains the same set of values (i.e. doing a histogram would be ok) but the ordering between columns doesn’t match (i.e. a scatterplot of A vs B changes every time i do the reading and has no meaning ).

This behavior only happens when I use MT and i have more than 1 file in the TChain.

I hope this makes it a bit clearer.

Thanks,
Giorgio

eguiraud · October 28, 2023, 8:30pm

Hi @Giorgio_Del_Castello ,

each different loop over events will provide a different permutation, so the symptoms look like you are reading each column in an independent run over the data.

For example this would result in what you see:

a = df.AsNumpy(["a"]).GetValue() # GetValue triggers the event loop
b = df.AsNumpy(["b"]).GetValue()

but this should not:

ab_dict = df.AsNumpy(["a", "b"]).GetValue()

You can check how many event loops you are running with df.GetNRuns() and you can activate RDF logging to have a better idea of what is happening when.

Cheers,
Enrico

Giorgio_Del_Castello · October 29, 2023, 12:35am

Hi Enrico,
I understand your point but i am sure that it happens even if I give only one AsNumpy() command as you said. I was now noticing that on a different machine this problem doesn’t happen 100% of the time but still happens pretty often and never happens if i disable the multithreading.

It is also a bit puzzling for me. From the logger the worst message i see is:

Warning in <TTreeReader::SetEntryBase()>: The current tree in the TChain qtree has changed (e.g. by TTree::Process) even though TTreeReader::SetEntry() was called, which switched the tree again. Did you mean to call TTreeReader::SetLocalEntry()?

But this happens independently of the reading (in MT) success.

When i use GetNRuns() I always get 1.

Thanks,
Giorgio

mczurylo · November 7, 2023, 8:14am

Hi @Giorgio_Del_Castello,

I’m sorry to be getting back to you this late. I have some time now again to look into your problem deeper but unfortunately the file transfer has expired, could you please share the reproducer file once more? I’m very sorry for the trouble.

Cheers,
Marta

Giorgio_Del_Castello · November 7, 2023, 10:11am

Hi @mczurylo ,
No worries. So i built the following to rough python scripts:

tree_generator.py makes 2 .root files with 3 branches containing numbers from 0 to 19999 and 20000 to 39999 respectively (written in ascending order).
reconstructor.py reads the root files created and compares them. The first time in reads them twice using the MT and compares the datasets read (once by sorting them with respect to the values of the column and the second time my sorting each column separately and then doing the comparison). Then it repeats the same process but disabling MT.

By running reconstructor.py several times (usually between 5 and 10) at some point i manage to produce the error in the first comparison (MT + sorting only with respect to the first column). Where the read columns don’t match in order. When I order each of them separately they always match so it’s not a problem of values read.
Also the error seems to appear only when I use the data generated with my custom classes (the root files starting with “Production”), if i use the data generated by tree_generator.py which is significantly simpler it seems to always behave correctly (the root files starting with “Moc”).

Anyhow here is the link for the download: https://we.tl/t-vDbO0uTXCi

And here is the error I am referring to:

Is MT Enabled:  True
====================================
In the following lines only the differences in the two RDataFrames are shown.
====================================


============= WITH MT ===============
*******Comparison without sorting*******
Total Number of Events 40000 DataFrames lengths match: True 

Column:  PulseBasicParameters@Min.fValue
	 Number of equal elements:  9073
	 Not nan values in dataset 1:  40000
	 Not nan values in dataset 2:  40000
	 Total Discrepancy: 173057.97151445856434293091297149658203125000000000000000
Column:  PulseIntegral@Integral.fValue
	 Number of equal elements:  2287
	 Not nan values in dataset 1:  40000
	 Not nan values in dataset 2:  40000
	 Total Discrepancy: 4662.74653546508852741681039333343505859375000000000000
*******Comparison with sorting*******
Total Number of Events 40000 DataFrames lengths match: True 

=====================================
Is MT Enabled:  False
============= WITHOUT MT ===============
*******Comparison without sorting*******
Total Number of Events 40000 DataFrames lengths match: True 

*******Comparison with sorting*******
Total Number of Events 40000 DataFrames lengths match: True

mczurylo · November 7, 2023, 12:31pm

Hi @Giorgio_Del_Castello,

Thank you for your reproducer. I can only use the part of the code where you produce the mock files and I don’t have any problems there - all the comparisons work well. In order to try and see your issue with the production files, I would need some more input from your custom class as I get the following warnings and errors:

Warning in <TClass::Init>: no dictionary for class Diana::QObject is available
Warning in <TClass::Init>: no dictionary for class Diana::QBaseType<int> is available
Warning in <TClass::Init>: no dictionary for class QHeader is available
Warning in <TClass::Init>: no dictionary for class Diana::QTime is available
Warning in <TClass::Init>: no dictionary for class QPulse is available
Warning in <TClass::Init>: no dictionary for class Diana::QVectorI is available
Warning in <TClass::Init>: no dictionary for class QPulseFiller is available
Warning in <TClass::Init>: no dictionary for class Diana::QBaseType<Long64_t> is available
Warning in <TClass::Init>: no dictionary for class QPulseInfo is available
Warning in <TClass::Init>: no dictionary for class QSampleInfo is available
Warning in <TClass::Init>: no dictionary for class Diana::QBaseType<double> is available
Warning in <TClass::Init>: no dictionary for class QRunData is available
Warning in <TClass::Init>: no dictionary for class DetectorName is available
Warning in <TClass::Init>: no dictionary for class RunType is available
Warning in <TClass::Init>: no dictionary for class pair<int,QChannelRunData> is available
Warning in <TClass::Init>: no dictionary for class QChannelRunData is available
Warning in <TClass::Init>: no dictionary for class QTree is available
Warning in <TClass::Init>: no dictionary for class QBaseTree is available
Warning in <TClass::Init>: no dictionary for class QTreeInfo is available
Warning in <TClass::Init>: no dictionary for class Diana::QBool is available
Error in <TBufferFile::ReadClassBuffer>: Could not find the StreamerInfo for version 3 of the class TNamed, object skipped at offset 72
Error in <TBufferFile::CheckByteCount>: object of class TNamed read too few bytes: 2 instead of 63559
Error in <TBufferFile::CheckByteCount>: object of class TTree read too many bytes: 63643 instead of 63565
Warning in <TBufferFile::CheckByteCount>: TTree::Streamer() not in sync with data on file ./Production_02_0005_002_000002_T_p001.root, fix Streamer()
Error in <TBufferFile::ReadClassBuffer>: Could not find the StreamerInfo for version 3 of the class TNamed, object skipped at offset 72
Error in <TBufferFile::CheckByteCount>: object of class TNamed read too few bytes: 2 instead of 63559
Error in <TBufferFile::CheckByteCount>: object of class TTree read too many bytes: 63643 instead of 63565
Warning in <TBufferFile::CheckByteCount>: TTree::Streamer() not in sync with data on file ./Production_02_0005_002_000002_T_p002.root, fix Streamer()
libc++abi: terminating due to uncaught exception of type std::runtime_error: Column "PulseBasicParameters@Max.fValue" is not in a dataset and is not a custom column been defined.

Cheers,
Marta

Giorgio_Del_Castello · November 7, 2023, 12:44pm

Ah sorry I must have had my ambient variables set. I will gather what you need and send everything asap.

Giorgio_Del_Castello · November 17, 2023, 3:23pm

Sorry it’s a bit complicated to gather everything everything is nested together. Is there a way to avoid selecting all the needed classes ?

system · December 1, 2023, 3:24pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.