RDataFrame Memory

vfranco · October 3, 2018, 12:01am

ROOT Version: 6.14.04
Platform: centos7
Compiler: gcc 48

Hi,

I’m trying to implement a few cuts using RDataFrame and then save the tree resulting from these cuts in a new file. I had a problem that both python and c++ versions just keep increasing memory use until they hit the cap and the code throws segmentation violation.

The original tuple in question is about 500MB.

Is there any bug in my code, or is this tool not supposed to be used like this?

I’m attaching the files that I wrote trying to do the same thing.

Thanks for the help!

dataReduce.cc (1.5 KB)
reduce_Data.py (1.1 KB)

Danilo · October 3, 2018, 6:33am

Hi,

we are looking at your reproducer. Thanks for the report.

Cheers,
D

Danilo · October 3, 2018, 6:35am

Hi,

I do not see obvious mistakes. Could you provide the input file so that I can try your code?

Cheers,
D

Eddy_Offermann · October 3, 2018, 6:49am

Hi Danilo,

You can reproduce it with this shortened version of tutorials/df007_snapshot.C .

-Eddy

modified vedf007_snapshot.C (981 Bytes)

eguiraud · October 3, 2018, 8:07am

Hi,
these kind of issues are usually due to a multi-thread snapshot of a ROOT file with highly suboptimal clustering (e.g. each entry of the input TTree is zipped by itself).
What happens in those cases is that reading takes much less than writing entries, so (uselessly large) unwritten buffers of data start accumulating in memory. See ROOT-9133.

So let’s check if you are indeed seeing ROOT-9133.

@Eddy_Offermann do you see elevated RAM usage when executing the macro with root -l -b -q or only when executing with EnableImplicitMT (i.e. root -l -b -q -t)? (will be running it myself asap)

@vfranco can you confirm that removing EnableImplicitMT() from your macro “fixes” the issue?
If yes, you can either check the clustering of your file with

TTree *t = nullptr;
file.GetObject("treename", t);
auto it = t->GetClusterIterator(0);
for (auto entry = it.GetStartEntry(); entry != t->GetEntries(); entry = it.GetNextEntry())
  std::cout << entry << std::endl;

or something similar (have not tested the code but should give you an idea)
or as @Danilo suggests you can share your file with us.

Cheers,
Enrico

Eddy_Offermann · October 3, 2018, 8:11am

Hi Enrico,

See it with root -l -b -q and root -l -b -q -t

-Eddy

eguiraud · October 3, 2018, 9:34am

Hi Eddy,
in your case I see roughly 10 MB/sec being allocated and never released.
It decreases to roughly 1 MB/sec if I change Snapshot(...) to Snapshot<int, float>(..., {"b1", "b2"}).
It’s the jitting – every dataframe computation graph that you allocate in that while(true) infinite loop just-in-time compiles a few things (the bigger of which is the call to Snapshot if you don’t pass the template parameters) and that memory is never released (which is expected).

If this is the amount of memory hogging that you see too, I don’t think this is a bug, in general it should not be a problem if instantiating an RDF computation graph with just-in-time compiled components takes a few MB.

If you see the amount of RAM hogging that @vfranco talks about (enough to use up all available RAM in a few seconds) I can’t see it. If not, I still think that @vfranco is hitting ROOT-9133.

Let me know
Cheers,
Enrico

vfranco · October 3, 2018, 11:29am

Hi,

Thanks for the help!

Unfortunately removing the EnableImplicitMT line hasn’t solved the problem.

I’ve put one example tuple file in /afs/cern.ch/work/v/vifranco/public/output_Data_tuple.root if you can access that.

But I didn’t quite follow why looping over the tuple would tell me if the tuple is badly compressed or not…

eguiraud · October 3, 2018, 11:53am

Thanks for the test data, we’ll check what’s going on asap.

The cluster iterator loops over TTree cluster boundaries. A cluster is a batch of entries that are compressed together. A normal cluster iteration jumps over many entries, so you should see something like 0 2140 5899 .... If you see the cluster iteration doing very small steps, e.g. 0 1 2 ... it means that you have bad clustering (e.g. each entry is compressed by itself) and that not only causes bad reading performance per se (no matter how you read the ROOT file), but it also happens to trigger this issue of Snapshot where (abnormally large) buffers to be written to disk queue in your RAM.

Cheers,
Enrico

vfranco · October 3, 2018, 12:17pm

Hi Enrico,

Thanks for the explanation.

So I had a look at the cluster sizes using this code:

TTree *t = nullptr;
file.GetObject("treename", t);
auto it = t->GetClusterIterator(0);
for (auto entry = it.GetStartEntry(); entry != t->GetEntries(); entry = it.Next())
  std::cout << it.GetStartEntry() << std::endl;

I guess this shows where every cluster begins? If that’s right, then it seems there are regular jumps of 20k-ish entries.

pcanal · October 3, 2018, 2:47pm

FYI, You can get the same info with just:

t->Print("clusters");

eguiraud · October 3, 2018, 3:33pm

Ok,
it seems my guess was incorrect then
Thank you for providing a small reproducer and the data, I will try to reproduce and get back to you asap.

Cheers,
Enrico

eguiraud · October 3, 2018, 4:08pm

@vfranco ok I know what the problem is Your ntuple has 564 branches, which means that your Snapshot(...) call is just-in-time compiled to a function call with 564 template parameters (think WriteBranches<type1, type2, type3, type4>(...).

Now, that takes a long while and a lot of RAM to compile. You never even reach the event loop, you just spend all your time jitting code.

This is issue ROOT-9468, and luckily I have a PR open that mitigates this problem by factors.

With the patch, it takes 20 seconds and a few hundreds megabytes to run your macro – most of the time is still spent in just-in-time compilation, but the situation is much much better than before.

I don’t know if anything else can be done for such large Snapshots.
Would this fix the problem for you?

Cheers,
Enrico

Eddy_Offermann · October 3, 2018, 5:27pm

First of all my apologies to @vfranco for hijacking his topic !

Enrico, I want to come back to your observation that if you spell out
to Snapshot the types, you significantly reduce the memory increase.
You do not seem to be worried by he fact that it is not reduced to zero, why ?

I observed in RDFActionhelpers.hxx, routine Finalize that you delete the Tree but do not seem to do anything with the Branch data elements.

TTree *t = new TTree(....);
TMyClass *data = new TMyClass();
t->Branch("name,"TMyClass",&data);

I do not see a

delete data;

-Eddy

eguiraud · October 3, 2018, 5:35pm

Hi Eddy,
there is still a tiny bit of jitting left in every dataframe instantiation (removing it is on the to do list), so every time you instantiate the compuitation graph you pay a bit of RAM to the interpreter that will only be released at application teardown.

If you think we have a memory leak could you open a jira ticket with the exact ROOT version and line number please? I can’t find the pattern you mention.

Cheers,
Enrico

Eddy_Offermann · October 3, 2018, 5:41pm

Hi Enrico,

The RDF…hxx is very abstract code and I do not dare to claim that the
branch data is not destroyed. I just mentioned the pattern and hope
that you know whether it was implemented.

-Eddy

eguiraud · October 4, 2018, 1:37pm

@vfranco the patch that speeds up snapshots of a large number of branches was just merged.
If you have the possibility to try out ROOT’s master branch (you can also pick up the nightly builds from cvmfs if recompiling is too annoying) it would be great if you could let me know if this is a reasonable solution for you.

(personally I’m also curious what kind of analysis requires an ntuple with so many variables )

Cheers,
Enrico

vfranco · October 4, 2018, 3:57pm

Hi @eguiraud,

First of all, thanks for the help with this.
I’m trying to find the build on the nightlies folder of CVMFS to test it out, but as long as the ram usage doesn’t explode, it should be fine.

Regarding the number of branches on the tuple: I agree that the analises won’t actually use all of these variables. But, in order to avoid having to re-submit jobs to the grid to get variables that were not present in the first place, I think it is fairly common to just generate a big tuple with everything possible and then trim the tuple to a more manageable size and variables afterwards.

Also, the tuple generation uses tools that calculate values for variables in blocks such as “kinematics” so you get all of them, even though one might use 2 variables in 10.

I agree it is a bit clunky.

eguiraud · October 4, 2018, 4:13pm

Hi,
memory does not explode – if it does, let us know.
The patch was merged today so you will have to wait for tomorrow’s nightlies
You can find them on cvmfs at /cvmfs/sft.cern.ch/lcg/views/dev3/latest/.

Cheers,
Enrico

system · October 18, 2018, 4:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.