pyROOT filling of RNTuple and reloading/plotting example


Dear experts,

Thanks to some suggestions/discussions from @siliataider , i managed to make workable in a code i am using in pyROOT the RNTuple from ROOT 6.36.

Often in python i was obliged to fill a dictionary of lists, to then create a pandas and then import in RDataFrame or dump stuff for each processed event and then reload and concatenate. It works fine for 10-20 events, but for 1000 or more i hit some wall in run time. I have not benchmarked if RNTuple is faster to flush-write while processing, but i guess so. Therefore i tried to write the same using RNTuple and it seems to work fine.

Here an example code i ended up with which almost emulate what i do with a more complex filling setup and with more branches of course to fill. I tought that it would be useful to have as part of the tutorials something like this also in pyROOT , if so let me know where it should be added/pulled. My next step would be to have the writer working with python - multiprocessing and i was wondering if you have any suggestions for it to work.

Thanks in advance,

Renato

import pandas as pd 
import numpy as np
import ROOT
import math 

RNTupleModel  = ROOT.RNTupleModel
RNTupleReader = ROOT.RNTupleReader
RNTupleWriter = ROOT.RNTupleWriter

class DummyStudy:
    def __init__(self, ofile='test.root'):
        self.model = RNTupleModel.Create()
        columns_add= { 
            'int'                    : [ 'int_dummy'  , 'int_dummy2'], 
            'double'                 : [ 'double_dummy', 'double_dummy2'],
            'bool'                   : [ 'bool_dummy' , 'bool_dummy2'],
            'std::vector<double>'    : [ 'vector_double_dummy', 'vector_double_dummy2']
        }
        for vartype, list_names in columns_add.items() : 
            for name in list_names: 
                print(f"adding type = {vartype} with name = {name}")
                self.model.MakeField[vartype](f"{name}")
        self.writer = RNTupleWriter.Recreate( self.model, "test_tree", ofile)        
        self.entry =  self.writer.CreateEntry()
    def fill(self, entry ) : 
        # some entry not filled, but still flushed with default zeroes or zero-sized vector (no warning ! ) 
        self.entry["int_dummy"] = 42
        self.entry["double_dummy"] = 42.
        self.entry["bool_dummy"] = False
        self.entry["vector_double_dummy"] = [ 9.,10.,20.]                
        self.writer.Fill(self.entry)                                                                

dummyStudy = DummyStudy( ofile='test_fill_uniquecreate.root')
for i in range(100000): 
    dummyStudy.fill(i)
del dummyStudy

# reload rdataframe from filled RNTuple
df = ROOT.RDataFrame("test_tree", "test_fill_uniquecreate.root")
hist = df.Histo1D("vector_double_dummy")
c = ROOT.TCanvas()
hist.Draw()
c.Draw()
1 Like

Thanks @rquaglia ! Good to see it worked.
I will let the relevant people know how this could be recycled into a proper tutorial, I believe it would be useful.

Also, just to understand

Are there limitations on the vector columns sizes and restrictions one has to take into account for filling RNTuple such having all vector columns of the same length?

In my more complex example (which i am debugging) i am not understanding wether for large vectors there are limitations or not ( and wether running on a mac there are options which i can specify ensuring there is nothing happening behind the scenes which would not be compatible).