RDataFrame - Read 3D array from TTree branch

Is it possible to use RDataFrame in PyROOT to read 3D(or more dimensional) arrays from a TTree branch?

I know I can write 3D arrays -

#include "TFile.h"
#include "TTree.h"
#include "TBranch.h"

void multimulti() {
    TFile *f = new TFile("multimulti.root", "RECREATE");
    f->SetCompressionLevel(0);
    TTree *t = new TTree("sample", "");
    Int_t i1[4][5];
    t->Branch("i1", i1, "i1[4][5]/I");
    for (int i=0; i<2; i++) {
        for (int j=0; j<4; j++) {
            for (int k=0; k<5; k++) {
                i1[j][k] = k;
            }
        }
        t->Fill();
    }
    t->Write();
    f->Close();
    exit(0);
}

and read them back in PyROOT(without RDataFrame) like this -

>>> import ROOT
>>> f = ROOT.TFile.Open("multimulti.root")
>>> tree = f.Get("sample")
>>> import numpy
>>> for x in tree:
...     print(numpy.frombuffer(x.i1, dtype="i4"))
... 
[0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]
[0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]

In general, RDataFrame should be able to read any C++ type. But I wonder how the arrays were saved into the tree. Do you know the type of the objects in the branch? Did you try something like

rdf = rdf.Define(..., "myArray[1][2][3]", ...)

?

If you want to do complicated things with these arrays, you can either write full C++ functions (ROOT.gInterpreter.Declare( ..... ); these are usually quite fast) or you need some more pythonic helpers. For something like this, we should wait for @etejedor or @swunsch to be back from vacation. Maybe they have an idea.

I had tried using rdf = ROOT.RDataFrame(tree).AsNumpy()["i1"] but the result was similar to that of the numpy.frombuffer example in the top post.

>>> for x in range(2):
...     for y in range(20):
...             print(rdf[x][y], end=" ")
...     print()
... 
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

And when I display it the RDataFrame,

>>> rdf = ROOT.RDataFrame(tree)
>>> x = rdf.Display("")
>>> x.Print()
i1  | 
0   | 
... | 
4   | 
0   | 
... | 
4   |

A better way to frame my question would be whether it is possible to get 3 nestedness for my 3D array using RDataFrame.

>>> rdf[0][0][0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' object is not subscriptable

Something like the following does not work :frowning:

>>> y = rdf.Define("test", "i1[0][0][0]")
input_line_189:2:13: error: subscripted value is not an array, pointer, or vector
return i1[0][0][0]
       ~~~~~^~
input_line_190:2:13: error: subscripted value is not an array, pointer, or vector
return i1[0][0][0]
       ~~~~~^~
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: Template method resolution failed:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(experimental::basic_string_view<char,char_traits<char> > name, experimental::basic_string_view<char,char_traits<char> > expression) =>
    Exception: Cannot interpret the following expression:
i1[0][0][0]

Make sure it is valid C++. (C++ exception)
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(experimental::basic_string_view<char,char_traits<char> > name, experimental::basic_string_view<char,char_traits<char> > expression) =>
    Exception: Cannot interpret the following expression:
i1[0][0][0]

Make sure it is valid C++. (C++ exception)

Maybe I am doing it wrong?
Using it as a 1D array like this does not throw an error -

>>> y = rdf.Define("test", "i1[0]")
>>> y.Display("").Print()
test | i1  | 
0    | 0   | 
     | ... | 
     | 4   | 
0    | 0   | 
     | ... | 
     | 4   |

Ok, obviously, the branches in this attempt don’t represent a type that can be indexed in multiple dimensions. Can you use TTree::Print() on the input data so we can find out how the arrays were saved? Is this already a 3D array or is it still the 2D from the first example?

Note that it only makes sense to process the tree with RDataFrame if you have to do work with these arrays that happens “inside each event”. If you just want to obtain a large-dimensional array spanning over all events, RDF is probably not the right tool.

Hi,

Just to double check with @pcanal: Philippe, it should be possible to read 3D array branches from Python, shouldn’t it?

Reik, my guess is that if you try to read a 3D array from Python (in the Python loop form, just like you did with the 2D array) you will get all the contents in a flat array that you can then reshape with numpy. But I don’t think you can define a 3D branch with RDataFrame just like you tried to do (@eguiraud can correct me if I am wrong).

1 Like

I think my terminology was a bit off. When I keep saying 3D array, I meant a two dimensional array per event. My apologies.

******************************************************************************
*Tree    :sample    :                                                        *
*Entries :        2 : Total =            1049 bytes  File  Size =       1057 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :i1        : i1[4][5]/I                                             *
*Entries :        2 : Total  Size=        716 bytes  File Size  =        231 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

It is the 2D array from the first example (written using the C code in the top post).

Is it because the array in this case is 2D and I am trying to define a 3D branch? Because defining a 2D array doesn’t work either -

>>> y = rdf.Define("test", "i1[4][5]")
input_line_179:2:13: error: subscripted value is not an array, pointer, or vector
return i1[4][5]
       ~~~~~^~
input_line_180:2:13: error: subscripted value is not an array, pointer, or vector
return i1[4][5]
       ~~~~~^~
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: Template method resolution failed:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(experimental::basic_string_view<char,char_traits<char> > name, experimental::basic_string_view<char,char_traits<char> > expression) =>
    Exception: Cannot interpret the following expression:
i1[4][5]

Make sure it is valid C++. (C++ exception)
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(experimental::basic_string_view<char,char_traits<char> > name, experimental::basic_string_view<char,char_traits<char> > expression) =>
    Exception: Cannot interpret the following expression:
i1[4][5]

Make sure it is valid C++. (C++ exception)

Or is it because redefining multidimensional arrays using RDataFrame is not possible?

Ok, I believe we have a clearer idea now what’s going on:
When you create a 2D c-style array, it actually gets written into the tree as a 1D array of size n_x*n_y = 4*5 in your case. That’s why when you retrieve it from python with numpy, you see 20 numbers. You would have to reshape this now to reflect the [4][5] structure.

The same problem happens inside RDataFrame nodes (C++): The 2D array comes back as a 1D array. To access it at the correct location, you have to access at
i1[x*5+y] (equivalent of i1[x][y])
where the constant is the size of the rightmost dimension. That’s because multi-dimensional C-style arrays are just unrolled into a long 1D array.

1 Like