Chunksizing RDF.AsNumpy for large ROOT file

As of ROOT 6.28, there is no option for chunk-sizing the conversion of rdataframe data to numpy data. This is not very convenient if someone wants to process very large root file(s) (~100GB) since there may not be enough memory to allocate the converted numpy data. Is there a workaround for this issue or is there any plan to add functionality such as chunks-sizing the conversion to reduce memory requirements?

Hi @AlkaidCheng ,

thank you for the feedback. That’s correct, chunking is not supported. The idea behind the feature is that most of the data processing would happen in RDF, and only a few selected events for few columns would then be dumped as numpy arrays at the end of the event loop for further in-memory processing.

Cheers,
Enrico

Hi @eguiraud ,
Thanks for the reply. Would it be possible to just load part of the ROOT file, e.g. with something like rdf = RDataFrame(treename, filename, start_event=0, end_event=100000) so that only parts of the events are evaluated?
I also notice that there is this npartitions argument in ROOT.RDF.Experimental.Distributed.[BACKEND].RDataFrame. Could a similar concept be used here to split the computation into multiple parts? I would imagine something like:

rdf_iterator = RDataFrame(treename, filename, npartitions=10)
# or 
# rdf_iterator = RDataFrame(treename, filename).createIterator(npartitions=10)
for rdf in rdf_iterator.next():
    rdf.ToNumpy()

Hi @AlkaidCheng ,

ok, if you need to export a lot of data (more than it fits in memory) as numpy arrays, and you can’t skim and slim the data during the RDF event loop first, let’s explore some potential workarounds (although they won’t be the fastest possible) :smiley: .

Until v6.26, you can do the chunking in a not-so-efficient way using Range.
In current trunk, that will soon be v6.28 (and that you can build from sources if you want), there is a more performant way to apply a range to a whole dataset, RDatasetSpec, which you can then pass to RDataFrame’s constructor. The only caveat is that we might still change the details of the RDatasetSpec interface before it is officially released.

Very similarly to npartitions in a distributed setting, for multi-thread executions the dataset is split in chunks that are processed by different threads. The problem is that all of this happens internally, and the only thing that is exposed, by design, is the final result: a single dictionary of numpy arrays.

I hope this helps!
Enrico

Thanks a lot! I have overlooked the Range feature which should solve my issue for now. Will definitely try out RDatasetSpec when it is out.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.