Question on memory comsumption with RDF

clementhelsens · November 10, 2020, 3:43pm

Dear experts,

I have not yet done any detailled study on the problem I’m facing, but it seems that if I try to run RDF on lxplus machines, my job is killed by the machine due to a too large memory comsuption.
For example, trying to run with ROOT.ROOT.EnableImplicitMT(10) with 11 input files, each of them containing 100k events (5GB in total) the job is killed after few minutes.

So before doing more detailed profiling, is there some recommendations to run over very large datasets? I’d like to process up to 100Million events, should we for example use more input files but with less event per file? Or nothing like this is supposed to take place and I mostly have a memory leak somewhere?

Thanks
Clement

eguiraud · November 10, 2020, 5:49pm

Hi @clementhelsens,
the size of your input files should not matter, RDataFrame (like all ROOT analysis interfaces) reads it in fairly small chunks and discards the analyzed chunks when done.

The procedure should typically require around 30 MB per thread, or similar. Then you have the per-thread results: many large TH3D could take some space, especially as you have 10 of each.

Alternatively, if you don’t specify axis ranges for output histograms: RDF stores all values and only computes the axis ranges and fills the histograms at the end. But consider that 100 million doubles “only” occupy less than 800 MB, and it looks like your problem size is more in the order of 1 million “things” than 100 million.

I suggest you check what’s going on with valgrind --tool=massif ./yourprogram (possibly using a debug build of ROOT – there are several available on lxplus as LCG releases – those with a dbg in the name). That tool should provide a log of who allocates how much and where.

Cheers,
Enrico

EDIT: also what ROOT version are you using? In versions before 6.22 just-in-time compilation of large RDF computation graphs could also require large amounts of memory. In 6.22 the issue has been largely mitigated – but anyway typically you need fairly large graphs (hundreds of Filters, Defines and actions) for this effect to be visible.

clementhelsens · November 10, 2020, 6:12pm

Thanks @eguiraud, I’m using root-6.20.04 and I have the order of 30 .Define, so will check with an earlier version
Cheers,
Clement

eguiraud · November 10, 2020, 6:34pm

30 Defines should not (must not ) be a problem. Your best bet to figure out what’s causing the memory usage is valgrind --tool=massif.

clementhelsens · November 11, 2020, 11:01am

@eguiraud, I did produce such valgrind output still with the root version that we have from key4hep software stack:
/cvmfs/sw.hsf.org/spackages/linux-centos7-broadwell/gcc-8.3.0/root-6.20.04-oml2a44t4nifq7zor3jxyi3zzttkumil/bin/root
I have no experience interpreting such output, and looking at it I can not spot any thing that could point me to a problem. In case you have some time, I have copied the file here:
/afs/cern.ch/user/h/helsens/public/4Root
in the mean time I will try with newer root version also including dbg

thanks
clement

eguiraud · November 11, 2020, 12:57pm

Hi @clementhelsens,
one way to look at the contents of that output is ms_print /afs/cern.ch/user/h/helsens/public/4Root/massif.out.9273 | less -S.

The most interesting lines are:

->75.45% (3,123,305,768B) 0x1032C168: TBuffer::TBuffer(TBuffer::EMode, int) (TBuffer.cxx:85)
| ->75.45% (3,123,305,768B) 0xFA68DC8: TBufferIO::TBufferIO(TBuffer::EMode, int) (TBufferIO.cxx:51)
|   ->75.45% (3,123,305,768B) 0xFA645A8: TBufferFile::TBufferFile(TBuffer::EMode, int) (TBufferFile.cxx:89)

which show that 3 GB out of 4 are allocated by TBuffer (i.e. by ROOT I/O). That’s not expected, but we have seen problems before with weirdly written ATLAS ROOT files.

@pcanal could this be the case (massif output attached for your convenience)?

@clementhelsens could you share one of the input files with us (or with me privately)?

Cheers,
Enrico

massif.out.9273.txt (151.6 KB)

clementhelsens · November 11, 2020, 1:08pm

Thanks @eguiraud, I have copied one of the input file in the same directory /afs/cern.ch/user/h/helsens/public/4Root
It is produced using EDM4Hep

You might need the libraries here:
/cvmfs/sw.hsf.org/spackages/linux-centos7-broadwell/gcc-8.3.0/edm4hep-master-kopc27l5fhxopkwfblet2xrwh6dbd322/

let me know if any other detailed on the EDM4Hep is needed, I tag @tmadlener that could give more detailed, also about the underlying mecanism for IO, PODIO.

Cheers,
Clement

eguiraud · November 11, 2020, 3:06pm

Hi @clementhelsens,
as far as I can tell (with @pcanal 's help), given the very high compression ratio of that file (more than 6) such a high memory usage is expected.

I believe this is a reproducer:

#include <TFile.h>
#include <TTree.h>
#include <TROOT.h>
#include <thread>

void read_entries() {
    TFile *_file0 = TFile::Open("/afs/cern.ch/user/h/helsens/public/4Root/events_014861423.root");
    auto t = _file0->Get<TTree>("events");

    for (int i = 0; i < 1000; ++i)
        t->GetEntry(i);
}

int main(int argc, char **argv) {
    int n_threads = 4;
    if (argc > 1)
        n_threads = std::atoi(argv[1]);
          
    ROOT::EnableThreadSafety();

    std::vector<std::thread> threads;
    for (int i = 0; i < n_threads; ++i)
        threads.emplace_back(read_entries);
    for (int i = 0; i < n_threads; ++i)
        threads[i].join();

    return 0;
}

You can increment the memory usage by incrementing the number of threads: each thread opens the TTree and occupies around 400MB with the staging area of the various uncompressed branches.

Maybe @pcanal or @Axel can suggest mitigation measures.

Sorry I can’t be of more help!
Enrico

tmadlener · November 11, 2020, 3:51pm

Hi @eguiraud,

In principle we do not touch the default settings when we create the output file. Is there an easy way to check what the actually used compression level is after the file has been written? Additionally, if we do not explicitly touch the compression settings, what else could affect that?

Cheers,
Thomas

clementhelsens · November 11, 2020, 4:26pm

@eguiraud, thanks for this reproducer. If each thread already takes 400MB, then as I was using ~10 threads by ROOT.ROOT.EnableImplicitMT(10) that explains this very large memory occupancy.
If indeed you would have suggestions on what would then be the best practice to process very large amount of data (for FCC studies at the Z-pole it’s supposed to be ultimately 10^12, but starting with 10^9 would already be great), please let us know. I was maybe having the naive impression that I could be using RDF on a local lxplus machine up to 10^9 events within a reasonable amount of time (~30min) but maybe not?

Cheers,
Clement

eguiraud · November 11, 2020, 4:42pm

Note that this memory usage does not depend on the amount of data, it’s a function of the compression ratio of the TTree (that you can see with TTree::Print) and the number of threads (because every thread opens a different copy of the tree).

How many cores does that machine have? I would expect a machine with 8 or more cores to have at least 16 GB of RAM (I think it’s common on the grid to get 1GB or RAM per core).

@tmadlener: @pcanal is the right person to reply to your questions in more detail.

eguiraud · November 11, 2020, 4:58pm

That’s a 600 kHz event processing rate – it’s doable in general depending on the workload (producing 10 histograms vs producing 1000), the latency of the storage (local disk vs EOS) and the compression algorithm (lzma-compressed files can take an order of magnitude more to decompress than zstd-compressed files).

For complex applications, especially on shared machines that read from not-so-fast network like the lxplus nodes, 600 kHz might prove challenging – but I’m glad to be proven wrong!

You can measure what your event processing rate is by running on just 10M events, it should be fairly constant as you throw more data at RDF (the only caveat is that the more clusters of entries are available the better RDF can parallelize, so you could get worse performance with little data when running with many threads).

clementhelsens · November 11, 2020, 9:03pm

Thanks @eguiraud, I believe we make great progress here. On the lxplus machine I use at the moment, I have 10 cores. So I decided to run a thread per core, over 50M events (500 files 100k events each). The job ran for ~25mins before he gets killed by the machine. Monitoring the memory usage, I saw it increased up to ~48% over 15min of the machine capabilities, and then stay ~ constant before the job got killed. I’m not sure I understand this behaviour. Why it would take 15mins for the 10 threads to start filling up all the requiered memory? I reduced this to 8 threads and seems the job is running fine, with a constant memory of ~40%, reached after ~10mins.

Concerning the IO, I read from eos and also write to eos. I know that is not optimal but given the amount of data, we do not have other solutions at the moment. What is nowadays the best way to process data from eos? I am not using xrootd, maybe that could help?

Also if there are specific compression we should be using to then further improve the processing time that would be good as well!

Cheers and thanks a lot for the help.
Clement

EDIT 12/11/2020
Adding more input data it crashed with the following error (full stack trace attached). Is that an expected behaviour when the output file is larger than 100GB? Users have to chunck the jobs in such a way the output file stay whitin a reasonable size?

Fill: Switching to new file: /eos/experiment/fcc/ee/tmp/flatntuples/Z_Zbb_Flavor_Uproot_test4/p8_ee_Zbb_ecm91__1.root
Fatal in <TFileMerger::RecursiveRemove>: Output file of the TFile Merger (targeting /eos/experiment/fcc/ee/tmp/flatntuples/Z_Zbb_Flavor_Uproot_test4/p8_ee_Zbb_ecm91.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting

rdf_stack.pdf (87.4 KB)

eguiraud · November 12, 2020, 9:07am

It shouldn’t, or at least, it should take little time to be very close to the maximum, and then maybe memory should slowly increase by a little bit as vectors get filled, buffers get resized etc, but nothing drastic. You can check what memory usage looks like over time with psrecord.

I’m not an EOS expert but I don’t think you can get higher throughput than what you get by simply opening the file, e.g. using xrootd instead (easy to test it though).

Different compression algorithms have much different deserialization times, which will probably impact your event processing rate. Very roughly, lzma is best compression/worst speed, lz4 is worst compressoin/best speed, zstd is typically a good compromise (but you have to benchmark for your data).

Ugh, that’s a bug that was recently fixed in ROOT master and will be fixed in the upcoming v6.22/04 and v6.24 releases.
In the meanwhile you can probably work around it by adding the following line at the beginning of your script:

TTree::SetMaxTreeSize(std::numeric_limits<Long64_t>::max())

(or the equivalent PyROOT call, something like ROOT.TTree.SetMaxTreeSize(thebiggestLong64_tvalue)).

Cheers,
Enrico

clementhelsens · November 13, 2020, 8:36am

Thanks @eguiraud, solved the issue!

eguiraud · November 13, 2020, 9:40am

Let us know if you have further memory usage/performance issues!

system · November 27, 2020, 9:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.