Problem: getting all zeros from branch in root file (PYROOT)

Hello expert,

First of all the piece of code:

for filename in sorted(os.listdir(path)):
	if filename.endswith('.root'):
		#chain.Add(filename)
		f = ROOT.TFile(filename,"read")
		counts = np.zeros((1,1,48,48), dtype=np.float32) 
		t = f.Get("tree")
		t.SetBranchAddress("branch", counts)
		box.append(counts)
		n += t.GetEntries()

As you could see branch is a [1][1][48][48] array and I would like to store its content in a
list per each entries. The problem is that both var counts that box are filled with all zeros.
I really cannot understand why. Some help?

Actually I’m able to read them making a chain. Then I process data with a
pandas dataframe but it is really really low and it requires a lot of memory resources,
so much so that the process is killed sometimes. (Consider that it takes one minute
and half just to read 233088 entries). Anyway this is the working code:

for filename in sorted(os.listdir(path)):
	if filename.endswith('.root'):
		chain.Add(filename)
for event in chain:
	counts = event.branch
	counts.SetSize(2304)
	box_counts.append(np.array(counts,copy=True))

So actually my question goes further: there is a faster way to retrieve big amount of data
from a branch?

Thanks in advance


ROOT Version: 5.34.36
Platform: Not Provided
Compiler: Not Provided


Hello,

I see that you are using ROOT 5.34, would it be an option for you to move to a newer ROOT version?

In order to get ROOT data into NumPy arrays, we now provide in ROOT 6.22 the AsNumpy function, which is used together with RDataFrame:

https://root.cern/doc/master/df026__AsNumpyArrays_8py.html

With AsNumpy, you would get a numpy array for your array branch where every position contains a flat std::vector which has the data for a particular entry of the tree. You could wrap those std::vectors with numpy arrays (arr = np.asarray(vector)) also. This should be more efficient than looping over the events in Python.

Well this is amazing! Thanks a lot!
I will try soon

Hi!

Here a reproducer how the readout looks like:

import ROOT
import numpy as np

# Write a file with a multidimensional c-style array
f = ROOT.TFile("file.root", "recreate")
counts = np.random.randn(4*4).astype(np.float32).reshape((1,1,4,4))
print('counts', counts)
t = ROOT.TTree("tree", "tree")
t.Branch("branch", counts, "branch[1][1][4][4]/F")
t.Fill()
f.Write()
f.Close()

# Read the array back as numpy array
df = ROOT.RDataFrame("tree", "file.root")
npy = df.AsNumpy()
arr = np.asarray(npy["branch"][0]).reshape((1,1,4,4))
print('branch', arr)
counts [[[[ 1.1287223  -1.6671165   1.1139607  -1.4637073 ]
   [ 1.0930232  -0.6981371  -0.32015494  0.123219  ]
   [-0.70519173  0.42165682  2.1224205  -1.0007468 ]
   [ 1.1853412   1.5306437   0.7852444  -0.8142945 ]]]]
branch [[[[ 1.1287223  -1.6671165   1.1139607  -1.4637073 ]
   [ 1.0930232  -0.6981371  -0.32015494  0.123219  ]
   [-0.70519173  0.42165682  2.1224205  -1.0007468 ]
   [ 1.1853412   1.5306437   0.7852444  -0.8142945 ]]]]

Be aware that the array is read back flat and you have to reshape it again to the original shape.

However, consider doing the analysis of the data with RDataFrame and push only what you need as numpy array to Python. That’s way more efficient and also runs natively multithreaded!

You can find the entry point to the RDataFrame docs here: https://root.cern/doc/master/classROOT_1_1RDataFrame.html

Best
Stefan

Thanks again Stefan!
RDataFrame seems really powerfull and that’s what’s right for me!
I don’t want to bother you anymore but just to understand…why for
you the SetBranchAddress() method didn’t work?

You mean for writing the tree? Not using the SetBranchAddress approach was not on purpose, I’ve just put together a suitable ROOT file to make the point!

FYI: You can check in your files the layout of the branch by opening it root -l filename.root and then writing treename->Print().

Sorry for bothering swunsch, is there a way to collect only certain
Branches of a Tree with RDataFrame? I was reading the guide but didn’t see
anything useful…so…if there is no soultion I won’t loose other time!
thanks again, sorry

Hi!

I appreciate that you ask! You can just use df.AsNumpy(['name_branch1', 'name_branch2']) to push only a subset of the dataset to numpy. Actually, that’s highly recommended because it’s simply much more efficient just to load what you need.

There are even more options to select what you need (e.g. you can also just exclude a subset of branches). A tutorial can be found here: https://root.cern/doc/master/df026__AsNumpyArrays_8py.html

Best
Stefan

Yes I saw thta yet, thanks!
It’s just that I was thinking to work with RDF instead of numpy but it’s the same!
Thank you again Stefan the Saver!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.