RDataFrame -> AsNumpy as jagged arrays

mivanov · February 17, 2022, 11:22am

Hello

I would like to use something like scikit-hep jagged array to retrieve information from the ROOT trees → dictionary-of-flat numpy-arrays.
E.g.:

event entry with dynamic array tracks with parameter attributes
track entry with array of clusters (position charge)
In many use cases we need some non-trivial selection criteria, so I can not use scikit-hep directly.
E.g:

In [59]: file["seed1"][b"seed.fParamArrayOut.fData"].keys()                                                                                                                                             
Out[59]: 
[b'seed.fParamArrayOut.fData.fUniqueID',
 b'seed.fParamArrayOut.fData.fBits',
 b'seed.fParamArrayOut.fData.fX',
 b'seed.fParamArrayOut.fData.fAlpha',
 b'seed.fParamArrayOut.fData.fP[5]',
 b'seed.fParamArrayOut.fData.fC[15]']

Recently I saw a news item about a change to RVec in the next version of ROOT. Is this the development in that direction?

https://github.com/scikit-hep/awkward-1.0/issues/588

Warning in <TStreamerInfo::Build>: Due to some major, backward-incompatible improvements planned for ROOT::RVec, direct I/O of ROOT::RVec objects will break between v6.24 and v6.26.
Please use std::vectors instead. See the release notes of v6.24 for more information.

For the moment investigating 3 options:

uproot
- restricted non-derived function
old ROOT with GetVal - generic - not custom code needed
- works with some problems
RDataFrame
- not clear how to achieve jagged arrays

Old root

Sometimes I need to access some derived variables. I have tried the old root interface and also RDataFrame.
With the old interface ROOT I used simple queries and later I used GetVal to retrieve the data in a contiguous array.
I created a wrapper that converts such queries into the dictionary of Numpy arrays.
For example in RootInteractive wrapper:

   def tree2Panda(tree, include, selection, **kwargs)
    entries = tree.Draw(str(variables), selection, "goffpara", options["nEntries"], options["firstEntry"])  # query data
    for i, a in enumerate(columns):
        val = tree.GetVal(i)
        ex_dict[a] = np.frombuffer(val, dtype=float, count=entries)

Works quite well, but unfortunately with some limitations:

TTree::GetVal does not work when I use TEventList or TEntryList.

RDataFrame

Using the RDataFrame AsNumpy interface to create a dictionary of Numpy array does not give us what we hope for.
It returns an array of objects (RVec in the above case).
To create a jagged version like in the old root, I guess I’ll have to write my own macro.

Could we use RDataFrameCache for this purpose?
Or is it possible to get a flat version already with the current root version 6.24.
From what I have read and tried in the tutorials, it seems impossible. Is there an implementation planned in the future?

Regards
Marian

Please read tips for efficient and successful posting and posting code

ROOT Version: v6.24.06
Platform: all
Compiler: all

eguiraud · February 18, 2022, 1:40pm

Hi @mivanov ,
RDataFrame currently cannot export data in jagged array format. As you saw, with AsNumpy you will get a NumPy array of objects instead. You can then convert those NumPy arrays of objects to the format you need – in the worst case it will cost a copy. You can also do this from C++, extracting the data you need from RDataFrame with Take calls (you will get std::vectors of RVecs or RVecs of RVecs) and then using awkward array’s C++ API to construct awkward-arrays from those. It should then be possible to expose those C++ awkward arrays to Python code, but you would have to ask the awkward-array devs on the best way to do that.

It is certainly desirable to have import/export to/from awkward-arrays, and the relevant issue is the one you found, RDataFrame integration · Issue #588 · scikit-hep/awkward · GitHub – however at the moment we do not have resources to implement this in ROOT.

@pcanal can you confirm/deny/help?

I am not sure I get this, I thought you wanted jagged arrays out, not flat? If you want flat arrays you can use a Reduce action to concatenate all arrays from all the events into a single large array, or implement your own helper for this purpose with Book.

I hope this helps!
Cheers,
Enrico

mivanov · February 18, 2022, 2:11pm

@pcanal, TTree::GetVal does not work when I use TEventList or TEntryList.

By running the Python debugger in my wrapper, I figured out the way.
Below is a simple code that reproduces the problem with GetVal and GetEstimate.
In my wrapper I used the switch “goffpara”, normally this works.

Observations:
0.) When using goffpara the query works as desired
1.)When using goffpara with EventList, the query sets the buffer to size 1
2.)When using goffpara with EventList

In [20]:     nEstimate=1000 
    ...:     ROOT.treeSeedsRef.SetEstimate(nEstimate) 
    ...:     ROOT.treeSeedsRef.Draw("gyCluster:gxCluster:gzCluster","1","goff") 
    ...:     print(f"0.)\t{ROOT.treeSeedsRef.GetEstimate()}") # this return nEstimate(1000) as it should 
    ...:     # 
    ...:     ROOT.treeSeedsRef.SetEstimate(1000000) 
    ...:     ROOT.treeSeedsRef.Draw("gyCluster:gxCluster:gzCluster","1","goff para") 
    ...:     print(f"1.)\t{ROOT.treeSeedsRef.GetEstimate()}")  # this returns 1 instead of original estimate, in case EventList set Drawing with goff para option reset buffer 
    ...:     # 
    ...:     ROOT.treeSeedsRef.SetEstimate(nEstimate) 
    ...:     ROOT.treeSeedsRef.SetEventList(0); 
    ...:     ROOT.treeSeedsRef.Draw("gyCluster:gxCluster:gzCluster","1","goff para") 
    ...:     print(f"2.)\t{ROOT.treeSeedsRef.GetEstimate()}") # this returns  nEstimate as it should                                                                                                    
0.)     1000
1.)     1
2.)     1000

pcanal · February 18, 2022, 5:34pm

It is supposed to.

Observations:
0.) When using goffpara the query works as desired
1.)When using goffpara with EventList, the query sets the buffer to size 1
2.)When using goffpara with EventList

I can not reproduce it (with hsimple.root and some arbitrary cutoff). @couet Any idea why the “para” option would change the estimate?

mivanov · February 18, 2022, 6:14pm

@pcanal and @couet
TTree::GetVal does not work when I use TEventList or TEntryList.

In my use case I am using a tree with tracks with RVec of clusters. I assume your hsimple.root uses a simple tree without dynamic arrays. Can I post my tree somewhere, or do you have another test tree with the dynamic arrays RVec’s?

pcanal · February 18, 2022, 7:06pm

Giving us access to your file is the most sure-fire way to get us to reproduce the problem

sbinet · February 19, 2022, 10:14am

it seems to me trying to shoe horn jagged arrays into numpy ones is a rather counter-productive endeavour: numpy arrays weren’t really designed to handle that kind of use-case (at least, not in a very efficient fashion).

probably a better library for this would be Apache Arrow (which has, of course, python bindings).
and there is an attempt at providing a “universal” ROOT <-> Arrow converter:

binaries are there:

https://go-hep.org/dist

hth,
-s

mivanov · February 19, 2022, 1:03pm

@sbinet probably a better library for this would be Apache Arrow 1 (which has, of course, python bindings). and there is an attempt at providing a “universal” ROOT ↔ Arrow converter:

Thank you for the suggestion. Unfortunately, such a tool only covers part of our use case. I will quote: RDataFrame integration · Issue #588 · scikit-hep/awkward · GitHub. For me mostly point 3 is important:

Such a thing would be useful because physicists could then mix analyses using Awkward Array, Numba, and ROOT C++ without leaving their environment. The benefits compound:

Data that are too complex to read from Uproot (efficiently or at all) can be loaded using MakeRootDataFrame and dumped into an Awkward Array.

Arbitrarily complex Awkward Arrays can be written to ROOT files by dumping the Awkward Arrays into an RDataFrame and taking a Snapshot.

Users can use ROOT C++ functions in an otherwise Awkward analysis at full speed.

@sbonet But I am considering using the tool you suggest for simpler cases. However, when I try the tool with the data described above, it looks like I am either not using the tool correctly or the tool is not able to convert “arbitrary” trees.

miranov@lxir128:/lustre/alice/users/miranov/NOTES/alice-tpc-notes2/JIRA/ATO-577/testArrow$ ~/bin/go-hep/root2arrow-linux_amd64.exe -o foo.data -t seed1 trackingDebug.root       
panic: rarrow: unsupported Go type *struct { ROOT_TMatrixTBase_double_ struct { ROOT_TObject rbase.Object "groot:\"TObject\""; ROOT_fNrows int32 "groot:\"fNrows\""; ROOT_fNcols int32 "groot:\"fNcols\"
"; ROOT_fRowLwb int32 "groot:\"fRowLwb\""; ROOT_fColLwb int32 "groot:\"fColLwb\""; ROOT_fNelems int32 "groot:\"fNelems\""; ROOT_fNrowIndex int32 "groot:\"fNrowIndex\""; ROOT_fTol float64 "groot:\"fTol
\"" } "groot:\"TMatrixTBase<double>\""; ROOT_fElements []float64 "groot:\"fElements[fNelems] elements themselves\"" }

mivanov · February 19, 2022, 1:12pm

@pcanal, @sbinet: Giving us access to your file is the most sure-fire way to get us to reproduce the problem

I posted the file on eos. I hope you can access it:

-rw-r--r--. 1 mivanov z2 987015733 19. Feb 14:09 /eos/user/m/mivanov/www/root-forum/rdataframe-asnumpy-as-jagged-arrays/trackingDebug.root

binet · February 19, 2022, 5:17pm

I am afraid I can’t access it.

couet · February 21, 2022, 8:14am

Sorry for the late reply.
“goffpara” seems me is not appropriate. “goff” means “no graphics output” and “para” means “produce a parallel coordinate plot”. The “para” option is purely graphics. So you are asking two incompatible options. I guess you should use “goff” only.

mivanov · February 21, 2022, 3:20pm

@couet “goffpara” seems me is not appropriate. “goff” means “no graphics output” and “para” means

In the past there were limitations for the number of variables. “goff para” was needed, now I see in the tutorial it is not needed anymore.
https://root.cern.ch/doc/v608/treegetval_8C.html

couet · February 21, 2022, 3:32pm

Yes it is better to not use “PARA” in that case. You do not produce graphics.

mivanov · February 21, 2022, 3:35pm

@sbinet [quote=“mivanov, post:9, topic:48835”]
/eos/user/m/mivanov/www/root-forum/rdataframe-asnumpy-as-jagged-arrays/trackingDebug.root
[/quote]

Indeed. Now I copied the file to my afs public. At minimum from my independent service account, I can access it.

[tpcdrop@lxplus8s06 ~]$ ls -alrt /afs/cern.ch/user/m/mivanov/public/root-forum/rdataframe-asnumpy-as-jagged-arrays/trackingDebug.root
-rw-r–r–. 1 mivanov z2 987015733 21. Feb 16:31 /afs/cern.ch/user/m/mivanov/public/root-forum/rdataframe-asnumpy-as-jagged-arrays/trackingDebug.root

mivanov · February 21, 2022, 3:45pm

https://cernbox.cern.ch/index.php/apps/files/?dir=/__myshares/www%20(id%3A437941)/root-forum/rdataframe-asnumpy-as-jagged-arrays&

pcanal · March 3, 2022, 9:42pm

I got the file but I can’t find the tree that has the branch gyCluster etc.
I.e. can you give us a script that reproduce the problem (with the EntryList and SetEstimate)

mivanov · March 6, 2022, 7:45pm

@pcanal

The code is highly experiment-dependent. So I can not just extract the smallest piece to reproduce the problem. Without loading the compiled macro, the code fails even before that.
https://gitlab.cern.ch/alice-tpc-offline/alice-tpc-notes/-/blob/master/JIRA/ATO-432/code/monoAnalysis.cxx#L1221
If I load the macro, the simplest code to reproduce the problem is here:

/*
  this is the simplest macro to reproduce the bug reported in the https://root-forum.cern.ch/t/rdataframe-asnumpy-as-jagged-arrays/48835/2
*/

void reproduceBug(){
  //  .L $NOTES/JIRA/ATO-432/code/monoAnalysis.cxx+g
  TFile *f= TFile::Open("dirClusters000/trackingDebug.root");
  TTree * treeSeedsRef = (TTree*)f->Get("seed1");
  //
  treeSeedsRef->SetAlias("gyCluster","sin(seed.fClustersTrack.fData.fY)*seed.fClustersTrack.fData.fX");
  treeSeedsRef->SetAlias("gxCluster","cos(seed.fClustersTrack.fData.fY)*seed.fClustersTrack.fData.fX");
  treeSeedsRef->SetAlias("gzCluster","seed.fClustersTrack.fData.fZ");
  // set event list
  treeSeedsRef->Draw(">>elistSelected","Max$(seed.fToTRange)>2","EventList");
  TEventList*elistSelected= (TEventList*)gDirectory->Get("elistSelected");
  treeSeedsRef->SetEventList(elistSelected);
  //
  treeSeedsRef->SetEstimate(1000000);
  treeSeedsRef->Draw("gyCluster:gxCluster:gzCluster","1","goff para");
  printf("%lld\n",treeSeedsRef->GetEstimate());
}

If you are interested, I can try to provide you with a singularity container that you can check on lxplus8.

mivanov · March 6, 2022, 7:53pm

@pcanal - at the moment I have started using RDataFrame to achieve similar functionality … but it needs more coding, respectively I am waiting for implementation of “awkward” <-> RDataFrame https://github.com/scikit-hep/awkward-1.0/pull/1295

system · March 20, 2022, 7:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.