Bad allocs from large input samples and different compression settings of in- and output

marian · January 4, 2019, 12:08pm

Dear experts,

with the script attached below I’m using RDataFrame to quickly trim large input chains/trees. This worked nicely for me in the past with moderate sized inputs. However, I was now trying to trim a ~40 GB tree with 24 threads and got a bad alloc after using a little more than 13 GB memory.

I then tried to manually set the compression settings to the ones from the input file (206), resulting in an improved memory consumption (up to 5 GB), but a different error (running with IMT, 24 threads)

Fatal: !fStack.empty() && "Trying to pop a slot from an empty stack!" violated at line 33 of `/mnt/build/jenkins/workspace/build/projects/ROOT-HEAD/src/ROOT/HEAD/tree/dataframe/src/RSlotStack.cxx'

I hope that a sort of remote diagnosis for this is possible, since making a reproducer would be time consuming.

Edit The output tree has about 165 variables and is ~5 GB

Thanks and a happy new year,
Marian

tree_trimmerRDF.cpp (6.2 KB)

_ROOT Version: 6.17.01 (nightly build)
_Platform: x86_64-slc6-gcc8-opt
_Compiler: g++ 8.2.0

Axel · January 5, 2019, 8:19pm

When back at work next week, @Danilo or @eguiraud might be able to have a look.

Danilo · January 7, 2019, 8:28am

Hi Marian,

I am working on this.

D

marian · January 7, 2019, 8:49am

Hi Danilo, Hi Axel,
Thanks for your replies. Is there a JIRA task/issue related to this? If yes, we can close this thread.
Cheers,
Marian

marian · January 7, 2019, 10:22am

Found it https://sft.its.cern.ch/jira/browse/ROOT-9892
Is the large memory consumption when “transforming” compression settings directly related?

Cheers,
Marian

eguiraud · January 7, 2019, 11:39am

Hi Marian,
the usual reason for large memory consuption with RDataFrame is that you are reading and writing out an input TTree with very bad clustering.
If you can confirm that running an event loop without a Snapshot does not show memory issues, then this is probably https://sft.its.cern.ch/jira/browse/ROOT-9587

Cheers,
Enrico

P.S. in any case, the large memory consuption is definitely not ROOT-9892

marian · January 7, 2019, 12:30pm

Hi Enrico,
I can confirm your assumption. I produced a histogram in the event loop and the memory usage stayed constant at ~2.4 GB.
Is there anything I can during production of the input data? I’m usually adding them from xrootd locations with hadd -ff.
Cheers,
Marian

Danilo · January 7, 2019, 1:55pm

Hi @marian,

about remote diagnosis: what is the invocation line of your trimming tool? I am interested in particular in the number of workers you set when you encounter the problem and the total number of logical cores of the machine you are running on.

Cheers,
D

marian · January 7, 2019, 2:10pm

Hi Danilo,
I’m setting the number of threads at runtime from an external config file in lines 92ff. It is the minimum of requested and available workers. For my test I used all available 24 logical cores, which is also the number you get from ROOT::GetImplicitMTPoolSize().

Cheers,
Marian

CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          4

eguiraud · January 7, 2019, 2:37pm

Is there anything I can during production of the input data? I’m usually adding them from xrootd locations with hadd -ff .

You can try with hadd -ff -O. If that doesn’t work @amadio might have a suggestion. It would also be interesting to see what tree->Print("clusters") says when tree is that hadd-ed input tree that causes problems with Snapshot.

Sorry for mixing posts with @Danilo but we are really dealing with two independent issues.

marian · January 7, 2019, 2:55pm

Printing the “bad” tree gives the output pasted below. In the meantime I’m re-hadding part of the files, trying different flags and update the post as soon as i see something interesting.

root [2] DecayTree->Print("clusters")
******************************************************************************
*Tree    :DecayTree : DecayTree                                              *
*Entries : 19681983 : Total =    110399348311 bytes  File  Size = 44663711205 *
*        :          : Tree compression factor =   2.47                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                86190            15952
1                86191            107467           15948
2                107468           135450           15952
3                135451           177769           15948
...

Note: about 5% look like

391              19037669         19063100             0

the last one is

404              19660070         19681982         15952

eguiraud · January 7, 2019, 5:35pm

Note: about 5% look like

 391              19037669         19063100             0

It could be that’s your problem. Hopefully @pcanal can confirm, but I suspect these are hadd-ed empty clusters…?

How many files do you hadd together? Do any of the files contain empty TTrees?

pcanal · January 7, 2019, 5:47pm

, but I suspect these are hadd-ed empty clusters…?

No quite. It is one (or more) files that were unclustered and contained (in total) around 25431 entries.

eguiraud · January 7, 2019, 5:56pm

What does “unclustered” mean in terms of TTree cluster iterator? Would the cluster iterator behave as if there were 25431 clusters of 1 entry?

marian · January 7, 2019, 7:04pm

To follow up, I tried to hadd smaller files and saw nothing suspicious. Since then i’m adding larger files

-rw-r--r-- 1 mstahl lhcb  18G Jan  7 16:43 test_ff.root
-rw-r--r-- 1 mstahl lhcb  26G Jan  7 18:50 test_fk404O.root
-rw-r--r-- 1 mstahl lhcb 4.1G Jan  7 19:47 test_ffO.root

The last one will probably take a few more hours. As the names indicate, the first one was produced with the -ff flag, the second with -fk404 -O and the third with -ff -O

Printing the trees results in (-ff) :

******************************************************************************
*Tree    :DecayTree : DecayTree                                              *
*Entries :  9020405 : Total =     53278376347 bytes  File  Size = 19285366548 *
*        :          : Tree compression factor =   2.76                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                3198174          15952
1                3198175          3235549              0
2                3235550          9020404          15952

and (-fk404 -O) :

******************************************************************************
*Tree    :DecayTree : DecayTree                                              *
*Entries :  9020405 : Total =     53248084972 bytes  File  Size = 27857835614 *
*        :          : Tree compression factor =   1.91                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                9020404          15952

During trimming i didn’t see a large difference in memory usage. However, IMT had larger impact on the -ff file (faster in real time, but slower in CP time)

Real time 0:02:06, CP time 2238.090 vs. Real time 0:03:15, CP time 912.820

The output of both jobs looks the same up to a few bytes

******************************************************************************
*Tree    :DsPiTuple : DsPiTuple                                              *
*Entries :  2323684 : Total =      3285018122 bytes  File  Size = 2262241595 * <- ff
*Entries :  2323684 : Total =      3284986667 bytes  File  Size = 2262200386 * <- fk404 O
*        :          : Tree compression factor =   1.45                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                2323683          -30000000

I hope this helps with the remote diagnosis

Thanks,
Marian

pcanal · January 7, 2019, 7:57pm

@eguiraud Priot to v6.14/06 due to a deficiency the ‘estimate’ was to have a series of cluster of size 1 !! In v6.14/06 and later the cluster size estimate is fTree->GetEntries() * cacheSize / zipBytes, i.e. the average number of entries that fits in the current TTreeCache size.

eguiraud · January 7, 2019, 10:13pm

Uhm I see thanks. Unfortunately this does not solve the mystery: Marian is on v6.17 and he sees Snapshot hogging memory, which should only happen when the TBufferMerger queue is filled with many buffers (which have a 10 MB minimum size even if they only contain a few KB of useful information) so my best guess is still that the clustering of the input file is such that each cluster contains much much less than 10 MB worth of data.

@marian, do you see a large unexplained memory usage if you just do RDataFrame(treename, filename).Snapshot("t", "outfile.root");? If yes, I’m afraid we’ll need one of those files: would it be an option to share them?

Cheers,
Enrico

marian · January 7, 2019, 11:10pm

Yes I still see large memory usage. I now attempted to run with an updated version (see attachment) of my script, which allows calling Snapshot with an empty string as third argument and just copies the tree. On the input mentioned in the initial post i get many

Error in <ROOT::Experimental::TBufferMergerFile::mkdir>: An object with name B2DsPiDTTuple exists already

messages.

You mentioned that i’m using ROOT 6.17. This is only true for the trimming script i’m afraid. The input files have been created some 2 years ago. Could this be the issue?

Cheers,
Marian

tree_trimmerRDF.cpp (7.1 KB)

EDIT
When running just the snapshot from the interactive shell I see a peak memory consumption of 13 GB, without errors from TBufferMergerFile but another

Fatal: !fStack.empty() && "Trying to pop a slot from an empty stack!" violated at line 33 of `/mnt/build/jenkins/workspace/build/projects/ROOT-HEAD/src/ROOT/HEAD/tree/dataframe/src/RSlotStack.cxx'

another EDIT
In the edit above I was running with the “wrong” input file (in LHCb we usually divide the data by magnet polarity and I got the wrong one). This accident could be interesting, since this “wrong” file only has one cluster

******************************************************************************
*Tree    :DecayTree : DecayTree                                              *
*Entries : 19521683 : Total =    109614334936 bytes  File  Size = 44568205380 *
*        :          : Tree compression factor =   2.46                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                19521682         15948

When running on the large input tree with 405 clusters, I see a memory consumption of up to 13 GB as well, but this time it completed successfully. For what it’s worth i’m also attaching some monitoring plots showing processing of the two files (1 cluster first, then the one with 405 clusters)

eguiraud · January 8, 2019, 10:35am

Cluster Range #  Entry Start      Last Entry        Size
0                0                19521682         15948

Hi, just to clarify, the output of Print("clusters") can be a bit misleading at first: this line indicates you have many clusters, all of the same size (15948 entries).

For the "Trying to pop a slot from an empty stack!" error, please refer to the corresponding jira ticket (it’s being actively worked on).

For the large memory usage, there is one last thing you can try to confirm that the cause is the jira ticket I shared above (i.e. bad clustering):

auto it = tree->GetClusterIterator(0);
auto clusterStart = 0ull;
auto clusterEnd = 0ull;
const auto nEntries = tree->GetEntries();
while (clusterEnd != nEntries)
{
   clusterStart = clusterEnd;
   clusterEnd = it.Next()
   std::cout << clusterEnd - clusterStart << std::endl;
}

This should print the sizes of the clusters as seen by RDataFrame.
If there are a lot of small values, that’s a problem.

If this test does not show irregularities, we will need a file to diagnose why Snapshot hogs that much RAM.

Cheers,
Enrico

marian · January 8, 2019, 3:00pm

Hi Enrico,
I don’t see many small values. Even the one with many clusters of different size doesn’t look too bad. I will let you know where the file is.
Cheers,
Marian