How does one fill with high performance a TTree in python in 2024?
Use case: we read raw binary events in python, and we want to fill a TTree with vector branches - where “vector” really means “variable size branches”, i.e. the typical use case before we woke up in the columnar analysis era.
Issue: of course, plain pyROOT e.g.
tree.Branch('object_pt', tree_object_pt)
for event_raw in binary_events:
event = do_something(event_raw)
tree_nobject = event.n_objects
tree_object_pt.push_back(event.pt)
tree.Fill()
is incredibly slow. I am not sure RDataFrame is actually meant for this use case, and other non-ROOT python tools promise good performance but sometimes good isn’t as good as better.
Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.
ROOT Version: Not Provided Platform: Not Provided Compiler: Not Provided
that would be really helpful - unfortunately I could not find examples on how to fill RDataFrames efficiently “in a loop” (only very successful examples on how to attach to a TTree which already exists)
Thanks for reaching out to the forum! Yes, RDataFrame can be used also to generate data and fill a TTree-based dataset. I’m not sure I grasp the full picture of your case, but there are many ways to fill a jagged-array to a column with RDF. For example, you can take a look at this other post. Let me know if that is enough for your use case or you need anything more.
Here is a stress-test example, @vpadulan - the Snapshot part is really slow:
import awkward as ak
import time
import sys
import numpy as np
import os
# Helper function to print with timestamps and time elapsed
def log(message, start_time):
current_time = time.time()
elapsed = current_time - start_time
print(f"[{time.strftime('%H:%M:%S')}] +{elapsed:.2f}s: {message}")
# Start timer
start_time = time.time()
log("Starting the code execution...", start_time)
# Initial parameters for the example
n_events = 100000
wf_length = 2000
n_channels = 4
log("Generating data for channels...", start_time)
# Create data for channels
channel = np.array([np.ravel([np.ones(wf_length, dtype="uint16") * i for i in range(n_channels)]) for i in range(n_events)])
log("Creating standardized data format...", start_time)
# Standardized Data Format
data = {
'timestamp': ak.from_numpy(np.cumsum(np.ones(n_events, dtype='int64'))),
'wf_v': ak.from_numpy(np.random.randint(0, 2**16 - 1, size=(n_events, wf_length * n_channels), dtype=np.uint16)),
'wf_ch': ak.from_numpy(channel)
}
log("Calculating total data size in MB...", start_time)
# Calculate total size of the data dictionary in MB
total_size_bytes = sys.getsizeof(data)
for key, value in data.items():
total_size_bytes += value.nbytes # Get the size of each numpy array in bytes
total_size_mb = total_size_bytes / (1024 ** 2)
print(f"Total dictionary size: {total_size_mb:.4f} MB")
log("Converting data to ROOT RDataFrame...", start_time)
time_A = time.time()
rdf = ak.to_rdataframe(data)
time_B = time.time()
log("Saving RDataFrame to 'example_rdf.root'...", start_time)
rdf.Snapshot('tree', 'example_rdf.root')
time_C = time.time()
Plus, this writes the RDataFrame altogether - while for such sizes one might rather want to keep memory free, and write chunks of events.