I have tried both python and C++, and get the same behaviour. Is there a way of looping through the entries in a TTree that doesn’t load the whole thing into memory?
As Wile mentioned, ROOT already does not load the data chunk-a-time. However even if it was loading all the data at once (300 MB), you should not be running out of memory (assuming your machine has several GB of RAM).
So something else is going. A straight-forward solution would be (if you can) to run your failing example on Linux and use the tool valgrind to pin point the problem; Alternatively, you can try to cut portion of your code until it stops failing and that might give you an indication of the issues. Another alternative is to build your code in debug mode and use the debugger to find out where it fails.
Something along these lines with RDataFrame should help
import ROOT
df = ROOT.RDataFrame("TTreeNameHere", "FileNameHere")
hours = [(datetime.datetime.now() + datetime.timedelta(hours=x)).timestamp() for x in range(5)]
opts = ROOT.RDF.RSnapshotOptions()
opts.fLazy = True # This avoids that Snapshot calls trigger the execution right away
# Book all different Snapshot calls in advance
snapshots = [
df.Filter(f"timestamp >= {hour_begin} && timestamp < {hour_end}")\
.Snapshot(treeName, FileName, listOfCols, opts)
for hour_begin, hour_end in zip(hours[:-1],hours[1:])
]
# Trigger execution of one of the Snapshots, all others will be executed at the same time
snap_df = snapshots[0].GetValue()
Yes, I have 8GB of RAM, so I wouldn’t expect it to actually have run out of memory.
I don’t have easy access to a linux version of root, so I would like to leave that as a last resort. However I based the code on a colleague’s python script using root that runs fine on linux using
tree = f1.Get( “Board 0” )
for evt in tree:
I’m confident that it’s the
tree->GetEntry(i)
call that is the problem. If I remove that then it runs without error, and running in the debugger shows it failing at that line.
To me, it’s strange that you explicitly: #define _HAS_CXX17 1
It may lead to severe problems (on Windows; it doesn’t seem to be used on Linux), I think.
You’re welcome! Note that my snippet above is not 100% working code, I also just updated to make it even more realistic but you may need to adjust it still. I also added the link to the RDataFrame docs if you need more information about the different parts of the API shown.
Hi, just adding another note here, this is literally the worst thing you can do in terms of performance when writing a Python script that processes a TTree, as described here. Please avoid this pattern at all costs
Thanks for testing it, that’s the result I would expect. It looks like it’s something specific to either Windows or my machine then.
The #define _HAS_CXX17 1 allows me to use C++ 17 features when I compile it as a normal C++ application, which I often do for speed of testing in Visual Studio. Without it my full code does not run in root. Removing that line from TestSnippet.cpp doesn’t change the behaviour.
FYI I manage to reproduce the crash. I will investigate.
P.S. Even a simple tree->Draw("timeStamp") doesn’t work:
root [0] TFile* tf = TFile::Open("run277_lf.root", "READ");
root [1] TTree* tree = dynamic_cast<TTree*>(tf->Get("Board 0"));
root [2] tree->Draw("timeStamp");
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
Error in <TRint::HandleTermInput()>: std::bad_alloc caught: bad allocation
root [3]
But works with the energy branch:
C:\Users\bellenot\Downloads>root -l
root [0] TFile* tf = TFile::Open("run277_lf.root", "READ");
root [1] TTree* tree = dynamic_cast<TTree*>(tf->Get("Board 0"));
root [2] tree->Draw("energy");
Info in <TCanvas::MakeDefCanvas>: created default TCanvas with name c1
root [3]
I am glad that RDF could work for you! But pay attention to one very important detail, it is not by chance that I was adding the opts = ROOT.RDF.RSnapshotOptions() in my code example above. In your latest snippet you are doing
for i in range(nHours):
...
df.Filter().Snapshot()
Which means that you are creating the new files one at a time, one per iteration. This is because the Snapshot method of RDataFrame is an “instant” method, that is it gets executed as soon as you call it by default. One of the best features of RDF is its lazyness: you can book all operations you want to run on the dataset, then execute them all in the same event loop just once. You can force Snapshot to be lazy too, with the RSnapshotOptions I was showing in my example above.
If you do that, you will fill all the files in the same event loop all together, instead of doing it one event loop per iteration of your for loop. It can potentially save you a lot of time.
Cheers,
Vincenzo