Interactively working with large ntuple/tree files. Discovered big difference between old and new root versions in speed. 7s vs 1 hour response time

I discovered that if I try to open a large (in this case 6.5G) ntuple root file with current or even older versions of root and plot some of the variables, it is taking a very very long time. about an hour, on a moden machine ryzen 5950x with 64gb of ram and nvme samsung 980pro drive.

While it takes only ~40seconds on a very old machine, running an SSD drive and 8gb of ram paired with an old intel quad core processor.

I though something might be wrong with the hardware, but traced the difference to the older version of root on the ā€˜olderā€™ machine.

I noticed that on root 6.12 the large file is very quick to plot from, while later versions provide a huge slowdown. I can confirm that at versions 6.18 it is already too slow, compared to the 6.12 for example.
Difference of 8 seconds going to 1 hour is staggering.

Some details:

root version 6.12 vs 6.22. Difference in opening and then trying to plot one of the ntuple variables (interactively) is 7s for root version 6.12 and slightly more then an hour for 6.22. I feel like such a difference was caused by some sort of change. In my mind the file should be loaded into memory, if there is enough memory to allocate and then the query should be very fast for an ntuple with 12 variables.
So this is a question for root experts, whether something was functionally change in how such requests are handled as root evolved.

OS: ubuntu 20.04
kernel 6.0.12
gcc 9.4.0

file details:

**************************************************************************************************************
size 6.5GB
root [1] ntp->Print()
******************************************************************************
*Tree    :ntp       : ntp                                                    *
*Entries : 121500222 : Total =      7129646342 bytes  File  Size = 6905776898 *
*        :          : Tree compression factor =   1.00     
*********************************************************************************************************

How to plot the variables?

I simply used the ntp->Draw(ā€œvariablenameā€) command
*ntp is the ntupleā€™s name.

I donā€™t see any change that could explain such a large slow down ;(
Could you share the file so we can investigate?

I uploaded a file to google drive, but the rule prohibits me from posting the link.
let me know how I can send a file that is large without sending the link.
I can email you the link perhaps?

You can try to PM it to me.

1 Like

I do not see an option to PM, perhaps it is due to my new user statusā€¦

you can email me at sher ā€œatā€ triumf.ca and I will send you the link to the file. Iā€™ve uploaded two root files, one is a bit smaller, but still provides a slowdown of x40. In root 6.12 I can plot the variable within a second, in 6.22 it take 40 seconds.

Just to reproduce the issue as close as possible what is the exact v6.12 version you are testing with.

the version I am using ROOT 6.12 06

compiled from source.

I received your email and replied, please let me know if you had trouble getting any of the files.

one is 300m.root is a smaller file, which still showcases the problem.

Root 6.12 I am able to plot a variable from this file in a matter of a second.

Using Root 6.22 is taking 40 seconds.

exact command I use:

ntp->Draw(ā€œWLā€)

WL is the wavelength variable, the output is from geant4 simulation with optical photons. Just to give background to the problem.

I was able to download the files (at least the small one) and can reproduce the problem. I will let you know what I find.

1 Like

thank you! i discovered it through chance, while migrating from old hardware to new, and initially thought it was a hardware issue, but then I traced it back to root versioning. Glad that the problem can be reproduced. I will wait to hear from you or other experts on this subject.

I was wondering if there are any developments in the investigation. I can overcome the slowdown problem by using older root versions, but I am very curios as to what is causing this?

The newer version of the TTreeCache is having problem with this particular file organization. Most likely because of the very small baskets, there is in average only 500 bytes in each baskets! You may want to increase (at write time) the baskets size by increase the cluster size (only 100 in your case). eg tree->SetAutoFlush(10000)

You can work around the problem at the moment by disabling the TTreeCache with

tree->SetCacheSize(0);

In the meantime, I am continuing to investigate what in the code changed in the TTreeCache (i.e. need to be fixed) that lead to the issue.

Cheers,
Philippe.

1 Like

Thank you, it might be due to the fact that I am writing out during a G4 simulation, during the Stepping function.
I wish I could help you with providing more information, specifically at which version exactly this transition happened, however I only tested 6.12 as having no problem, and 6.18 already having a problem. I did not try all the releases in between.
thank you for looking into this!
I will try the (SetCacheSize as a trick to fix this for later root versions).

kindest regards,

Aleksey

When producing the file, the AutoFlush setting has been set to 100 entries. For your data set this means that a cluster (the unit of reading by the TTreeCache is only 2636 of (compressed) bytes ; the typical value here is 32 Mb !!). For example you can rewrite your file using the following script:

{
TFile *inputFile = TFile::Open("300m.root");
auto input = inputFile->Get<TTree>("ntp");
input->SetCacheSize(0);
input->SetAutoSave(0);
auto outputFile = TFile::Open("300m-smaller.root", "RECREATE")
input->SetAutoFlush(-32000000);
output = input->CloneTree(-1, "")
outputFile->Write();
delete inputFile;
delete outputFile;
}

In my test this result in a file where the TTreeCache is back to the expected behavior (and is actually close to 10% faster than without).

And possibly as importantly the resulting file is actually compressed:

874043128 for the original 300m.root
237449042 for the newer 300m-smaller.root

I.e. a factor 4 (!!!) smaller.

To solve the problem at it source, you need to make sure to use the proper AutoFlush setting when writing the file (i.e. from Geant4) by calling something equivalent to:

tree->SetAutoFlush(-32000000);

If Geant4 does not let you customize this, then please open a Geant4 issue report to allow user to create efficient files.

Cheers,
Philippe.

PS. The original problem is still disappointing and should be fix even if the file is very inefficiently produced (see the factor 4 in size)

Thank you Philippe for a detailed explanation.
I can confirm on my end that if I do ntp->SetCacheSize(0) for the problematic trees. (ntp is just the name of the ntuple in a root file) it does solve the problem.
So there is a simple walkaround. However seems like older versions of root might had this as default and later it changed.

I will do an experiment of trying to SetAutoFlush(-32000000) inside G4 to see if it is possible. And will udate this thread with the results.
On the other hand I hope you can trace the change in root versions that prevented the handling of such ineffectively created files, when older versions handled them with no problems, I hope the fix will come in the next release.

kindest regards,

Aleksey

p.s. After this discussion my status now has changed, and I can send messages to other members, which would be helpful in this situation right away, but I think it was not a roadblock to the solution in the end.

This issue is resolved by the PR Fix TTreeCache Scaling with the number of clusters/baskets. by pcanal Ā· Pull Request #12650 Ā· root-project/root Ā· GitHub

This was due to an initialization of cache done inside a loop instead of outside the loop that lead to a O(N^2) behavior instead of O(N) (and improve to O(log N) in the PR) but on code that is usually fast compare to the rest of the I/O pipeline (decompression and unboxing).

The particular of this file:

  • cluster size of 100 entries for only 12 branches each with a single float ;
  • consequently a high number of cluster (already 150,000 in the small version of the file)
  • and very little (actually apparently no) compression

Due to the simplicity of the data model and no compression, the I/O pipeline is very fast, making the TTreeCache code a larger fraction of the run time. And when that no-longer that small code is then run (uselessly) 100,000s more times ā€¦ it account for the huge increase of the run-time you saw (and why most other donā€™t see it, at least not enough to report it).

Note: the problem apears in v6.14 and will be fixed in v6.30/00 and v6.28/04 (and most likely in the patch branches of some of the older releases).

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.