Filling vector branches in python in 2024

How does one fill with high performance a TTree in python in 2024?

Use case: we read raw binary events in python, and we want to fill a TTree with vector branches - where “vector” really means “variable size branches”, i.e. the typical use case before we woke up in the columnar analysis era.

Issue: of course, plain pyROOT e.g.

tree.Branch('object_pt', tree_object_pt)
for event_raw in binary_events:
   event = do_something(event_raw)
   tree_nobject = event.n_objects
   tree_object_pt.push_back(event.pt)
   tree.Fill()

is incredibly slow. I am not sure RDataFrame is actually meant for this use case, and other non-ROOT python tools promise good performance but sometimes good isn’t as good as better.


Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


1 Like

I’m pretty sure rdataframe can help here. Maybe @vpadulan or @mczurylo can give some hints

that would be really helpful - unfortunately I could not find examples on how to fill RDataFrames efficiently “in a loop” (only very successful examples on how to attach to a TTree which already exists)

OK, let’s try to ping again @vpadulan

Dear @rene,

Thanks for reaching out to the forum! Yes, RDataFrame can be used also to generate data and fill a TTree-based dataset. I’m not sure I grasp the full picture of your case, but there are many ways to fill a jagged-array to a column with RDF. For example, you can take a look at this other post. Let me know if that is enough for your use case or you need anything more.

Cheers,
Vincenzo

Here is a stress-test example, @vpadulan - the Snapshot part is really slow:

import awkward as ak
import time
import sys
import numpy as np
import os

# Helper function to print with timestamps and time elapsed
def log(message, start_time):
    current_time = time.time()
    elapsed = current_time - start_time
    print(f"[{time.strftime('%H:%M:%S')}] +{elapsed:.2f}s: {message}")

# Start timer
start_time = time.time()
log("Starting the code execution...", start_time)

# Initial parameters for the example
n_events = 100000
wf_length = 2000
n_channels = 4

log("Generating data for channels...", start_time)
# Create data for channels
channel = np.array([np.ravel([np.ones(wf_length, dtype="uint16") * i for i in range(n_channels)]) for i in range(n_events)])

log("Creating standardized data format...", start_time)
# Standardized Data Format
data = {
    'timestamp': ak.from_numpy(np.cumsum(np.ones(n_events, dtype='int64'))),
    'wf_v': ak.from_numpy(np.random.randint(0, 2**16 - 1, size=(n_events, wf_length * n_channels), dtype=np.uint16)),
    'wf_ch': ak.from_numpy(channel)
}

log("Calculating total data size in MB...", start_time)
# Calculate total size of the data dictionary in MB
total_size_bytes = sys.getsizeof(data)
for key, value in data.items():
    total_size_bytes += value.nbytes  # Get the size of each numpy array in bytes
total_size_mb = total_size_bytes / (1024 ** 2)
print(f"Total dictionary size: {total_size_mb:.4f} MB")

log("Converting data to ROOT RDataFrame...", start_time)
time_A = time.time()
rdf = ak.to_rdataframe(data)
time_B = time.time()

log("Saving RDataFrame to 'example_rdf.root'...", start_time)
rdf.Snapshot('tree', 'example_rdf.root')
time_C = time.time()

Plus, this writes the RDataFrame altogether - while for such sizes one might rather want to keep memory free, and write chunks of events.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.