No copy conversion from TTreeReaderArray<float> to numpy.array

mverzett1 · March 5, 2019, 1:54pm

Dear experts,

Is there a way to convert a TTreeReaderArray<float> content into a numpy.array without copy? I can do it with the normal TTree interface:

for entry in tree:
  np.frombuffer(event.Jet_eta, count=event.nJet, dtype='float32')

But the same does not work on the TTreeReaderArray . It seems to allocate less memory than actually used:

np.frombuffer(event.Jet_pt.GetAddress(), count=event.Jet_pt.GetSize(), dtype='float32')
*** ValueError: buffer is smaller than requested size

(in this case size is 7)
If I run with count=3 I get the first three entries correctly though.
From the Mattermost channel I was suggested to first call GetSize() for force the reader to fill the buffer, it does not help. Neither it helps to loop over all the entries once.

Thanks!

ROOT Version: 6.12/07
Platform: SLC6
Compiler: GCC 7.0.0

etejedor · March 5, 2019, 5:13pm

Hi @mverzett1

I am afraid you cannot avoid the copy since TTreeReaderArray will also reuse the same space for reading subsequent entries. AFAIK it initially allocates a buffer with the maximum size it finds in the current basket for the size branch. If necessary, for a given entry it will reallocate with a bigger size.

mverzett1 · March 6, 2019, 8:22am

Hi @etejedor,

I’m fine with the array becoming unavailable in the next event, it is mostly to speed up some computations, but apparently it does not even read the full array in the event, or at least does not make it visible. Is there a way to force it to do so?

Thanks.

etejedor · March 6, 2019, 11:05am

GetSize should tell you the right size of the array. You can also try creating a TTreeReaderValue for the size branch and use it in frombuffer. In that case, are all the elements of the array there?

mverzett1 · March 6, 2019, 12:05pm

That’s the problem.
I can easily get the size in any of the ways you mentioned and I get consistent outputs, but once I try to convert the GetAddress() output with frombuffer it complains that the buffer is of a smaller size than requested. If I reduce the size I can get the first three elements out of 7 (not sure is important) correctly.

eguiraud · March 6, 2019, 12:20pm

Hi,
there are a few things to note:
you do need to trigger the loading of the TTreeReaderArray contents before using them, because loading is lazy.

It’s very weird, verging on impossible, that python is able to tell that TTreeReaderArray’s buffer is too small: C pointers do not carry size information, python should have no way to tell when the buffer “ends”. How does that work?

One other quirk that I just remembered is that TTreeReaderArray does not guarantee that all elements of the array will be stored in contiguous memory: if the elements of the array you are reading are datamembers of objects written to the TTree in arrays, TTreeReader might have to skip the other datamembers of the parent objects when going from one array element to the other, causing “gaps” in their addresses.

You might also want to double-check that GetAddress always returns the same address as &arr[0] (I think so, but I’m not 100% sure and the docs are hazy on the topic).

swunsch · March 6, 2019, 1:25pm

With ROOT on master (or probably 6.16/02), you can do the following. Probably this would be a solution for your task!

# Source latest ROOT, e.g., on lxplus
source /cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-slc6-gcc8-opt/setup.sh

import ROOT
import numpy as np

# Read a file from EOS with a float arrays in it, e.g., the Muon_pt branch
# We read only 5 events of the dataset, see the Range(5) method.
filename = "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root"
data = ROOT.ROOT.RDataFrame("Events", filename)\
                .Range(5)\
                .AsNumpy(["Muon_pt"])

# Iterate over the data
for i, x in enumerate(data["Muon_pt"]):
    print("Event: {}".format(i))
    # The data dictionary contains numpy arrays of C++ types,
    # in this case, ROOT::RVec<float> objects
    print("ROOT::RVec<float>: {}".format(x))
    # However, you can adopt their memory (zero-copy!) with numpy
    y = np.asarray(x)
    print("numpy.array: {}".format(y))

Event: 0
ROOT::RVec<float>: { 52.0083f, 42.8570f }
numpy.array: [52.008335 42.85704 ]
Event: 1
ROOT::RVec<float>: { 5.01995f }
numpy.array: [5.0199485]
Event: 2
ROOT::RVec<float>: { 15.9674f, 12.4813f }
numpy.array: [15.967432 12.48129 ]
Event: 3
ROOT::RVec<float>: { 53.4283f, 38.4376f }
numpy.array: [53.428257 38.437614]
Event: 4
ROOT::RVec<float>: { 7.17855f, 5.59734f }
numpy.array: [7.17855 5.59734]

mverzett1 · March 6, 2019, 2:05pm

@eguiraund

I tried reading the full array [reader[i] for i in range(reader.GetSize())] and retry the numpy conversion with the same result

Indeed I found it wierd too, but somehow it knows the size. It’s beyond my knowledge. Each array is in a separate branch, and not belonging to objects afaik, therefore I hope they are contiguous in memory.

How can I do that?

@swunsch I guess that your solution would become unbearably slow if run entry-by-entry on a large range of events. Or not?
For the record: I really like your RDataFrame implementation but I’m trying to get a simple framework for my student and I fear that RDataFrame would become overly complicated when it comes down to computing all the 4-jet permutations (and related features) of a N-jet collections, with scale factors taken from histograms.

Thanks!

Mauro

eguiraud · March 6, 2019, 2:47pm

I tried reading the full array [reader[i] for i in range(reader.GetSize())] and retry the numpy conversion with the same result

Just one preemptive GetSize() should be enough to trigger loading of the whole array.

You can use ROOT.AddressOf.

A note: according to the numpy.frombuffer doc the first parameter has to expose python’s buffer interface. Are you sure that PyROOT transforms the value returned by TTreeReaderArray::GetAddress to something with the proper behavior?

I’m sorry that I only have more questions rather than an explanation – hopefully highlighting the grey areas of that approach you might be able to figure out what breaks.

If the dataset fits in memory, or if it fits in memory after some preliminary cuts, @swunsch’s approach is the easiest and it has good performance. Otherwise you can use RDataFrame::AsNumpy to e.g. load the contents of one file at a time into memory.

Cheers,
Enrico

swunsch · March 6, 2019, 2:56pm

Just had as well a look at the numpy.frombuffer doc, and I agree that I doubt that numpy is actually doing the right thing. But well, what you can try is adding an array interface dictionary to the object and pass it to numpy.asarray (or the numpy.array constructor should work as well). How it works in long is described here, in short here:

Take the python object x
Add the attribute x.__array_interface = {"data": (<pointer as long>, False), "typestr": "<see doc>", "shape": (<the size>,), "version": 3} (no guarantee I’m missing somthing here)
make the zero-copy conversion with y = np.asarray(x)

Here, minimal reproducer: python -c "import numpy; print(numpy.array([1, 2, 3]).__array_interface__)"

swunsch · March 6, 2019, 3:00pm

But well, I agree. If your data fits in memory, RDataFrame.AsNumpy is basically doing this thingy for you. With multi-threaded read-out and all RDataFrame features available (ofc before you dump the data to memory).

eguiraud · March 6, 2019, 3:31pm

So, I checked: PyROOT converts pointers to python buffer objects of 1 element, but there is a SetSize method to change the buffer size.

@mverzett1 maybe, if all array elements are contiguous in memory, you can do

size = arr.GetSize() # force loading of array contents for this entry
buf = arr.GetAddress() # still not sure this is always equivalent to `&arr[0]`
buf.SetSize(size)
np.frombuffer(buf, count=size, dtype='float32')

mverzett1 · March 7, 2019, 10:18am

@eguiraud indeed that works!
Thanks!

Mauro

system · March 21, 2019, 10:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.