Using RDataFrame for vector<vector<double>> Branches

fstocker · September 10, 2019, 9:03am

Hi
I am using RDataFrame with ROOT v06.16.
I analyse a nested TTree that contains some MC data. Branches contain a single entry for primary particles, (coordinates, length of track, etc.).

*............................................................................*
*Br   64 :true_beam_Start_X : true_beam_Start_X/D                            *
*Entries :    25646 : Total  Size=     531994 bytes  File Size  =     467493 *
*Baskets :     3051 : Basket Size=      32000 bytes  Compression=   1.01     *
*............................................................................*

For each primary particle the associated daughters are stored in vectors (have different sizes of course) with their properties like the startX coordinate:

*............................................................................*
*Br   52 :reco_daughter_startX : vector<double>                       *
*Entries :    25646 : Total  Size=     823016 bytes  File Size  =     563266 *
*Baskets :     3051 : Basket Size=      32000 bytes  Compression=   1.35     *
*............................................................................*

The dEdx for each particle is stored for all points on the track, so for a primary particle it is a vector, for a daughter particle it is a matrix vector vector double

*............................................................................*
*Br   39 :reco_daughter_dEdX : vector<vector<double> >                       *
*Entries :    25646 : Total  Size=    5882753 bytes  File Size  =    3537259 *
*Baskets :     3052 : Basket Size=      32000 bytes  Compression=   1.65     *
*............................................................................*

Now:

I’d like to create a new branch containing the MEAN dEdx for each daughter, so it should be a vector. The value needs to be computed from their corresponding dEdx vectors in the matrix but I can’t figure out how to navigate through the vector<vector> variable and I always get errors.
I’d also like to filter for certain dEdx values that are contained in the matrix.

Has someone ever encountered this and has succesfully solved it?

Also a more general question:
RDataFrame looks very appealing to analyse TTree’s in a very dynamic way. Is it the now recommended framework to perform a data Analysis? If not, which one is better suited? TTreeReader?

Thanks a lot!

eguiraud · September 10, 2019, 9:35am

Hi Francesca,
what are the errors you are encountering and how does the relevant part of the code look like? (it would be great if you could share a minimal reproducer of the problem).

I’d definitely recommend RDataFrame for most usecases (full disclosure, I am one of the authors ). Also fyi, RDataFrame leverages TTreeReader internally, it just adds a few features and tries to offer a nicer syntax.

Cheers,
Enrico

fstocker · September 10, 2019, 2:01pm

Hi Enrico,
thanks for the quick reply.

Here is the error I get:

Error in <TTreeReaderArrayBase::CreateContentProxy()>: The branch reco_daughter_dEdX contains data of type vector<double>. It cannot be accessed by a TTreeReaderArray<double>

please find attached the snap.C macro that tries to read the snap_pionana.root with 4 relevant branches. it should reproduce the error when trying to compile and run

thank you

snap_pionana.root (63.7 KB)

snap.C (1.3 KB)

eguiraud · September 10, 2019, 2:20pm

I see,
the problem is that allDaughterMean_dEdX is a vector<vector<double>> as you said, but RVec can only “hide” one level of for loops, i.e. RVec<double> reads vector<double> fine, but you need RVec<vector<double>> for a vector<vector<double>>.

Unfortunately that means that you’ll have to write the for loop that evaluates the mean of the vector explicitly.

Something like this should work or at least give you an idea:

#include "TCanvas.h"
#include "TROOT.h"
#include "TGraphErrors.h"
#include "TH1.h"
#include "TLegend.h"
#include "TArrow.h"
#include "TLatex.h"
#include "TMath.h"
#include <ROOT/RDataFrame.hxx>


#include <iostream>
#include <math.h>
#include <string.h>
#include <stdio.h>

//using RDataFrame to cut and analyse PionTtrr

using namespace std;
using namespace ROOT::VecOps;

double meanVectorVector(const ROOT::RVec<vector<double>> &reco_daughter_dEdX){
   double mean_sum = 0;
   unsigned int mean_count = 0;
   for (auto v : reco_daughter_dEdX) {
      for (auto e : v) {
         mean_sum += e;
         ++mean_count;
      }
   }
   return mean_sum / mean_count;
}

//***********************
//Main Function
int snap(){
   ROOT::RDataFrame frame("myTree", "snap_pionana.root");
   TFile *output = new TFile("output.root", "RECREATE");

   //Lambda for Mean of the primary Particle dEdX (of type vector<double>), that works well
   auto meanPrimarydEdX = [](const ROOT::RVec<double> &dEdX){return Mean(dEdX);};

   //Histos
   //This one Works:
   auto primarydEdX = frame.Define("primMean_dEdX", meanPrimarydEdX, {"dEdX"}).Histo1D("primMean_dEdX");

   
   //This doesn't
   auto allDaughterdEdX = frame.Define("allDaughterMean_dEdX",meanVectorVector,{"reco_daughter_dEdX"}).Histo1D("allDaughterMean_dEdX");

   primarydEdX->Draw();
   primarydEdX->Write();

   allDaughterdEdX->Draw();
   allDaughterdEdX->Write();

   return 0;
}

Hope this helps!
Enrico

fstocker · September 10, 2019, 3:07pm

thank you Enrico for your quick help, this works.
I didn’t realise that with a vector<vector<double>> I will have to do the for loop still.

So also if I’d like to filter for a specific value inside one of the vector<vector<double>> I will have to go through a custom for loop?

Thanks!

fstocker · September 10, 2019, 3:35pm

Actually, I just realised: your macro returns the mean of all the entries in the vector<vector<double>>. I was trying to access the mean of each column (as each column in the vector<vector<double>> is for the corresponding daughter particle. So the function should return a vector with the mean of each column, not just one value.

I was able to change the function correctly and this function does what I was looking for (just posting it for completeness)

std::vector<double> meanVectorVector(const ROOT::RVec<vector<double>> &reco_daughter_dEdX){
   std::vector<double> mean;
   double mean_sum = 0;
   unsigned int mean_count = 0;
   for (auto v : reco_daughter_dEdX) {
      for (auto e : v) {
         mean_sum += e;
         ++mean_count;
      }
      mean.push_back(mean_sum/mean_count);
      mean_sum = 0;
      mean_count = 0;
   }
   return mean;
}

eguiraud · September 10, 2019, 4:54pm

Cool, and then it’s fine if Histo1D("allDaughterMean_dEdX") is filled with each element of the vector of means right?

I’m afraid so…RVec<T> is a nicer vector<T>: if T is itself a vector<U>, suddenly most of RVec’s helper functions don’t know what to do: what’s the mean of a vector of vectors? Turns out that you wanted the mean per column and I didn’t know and even that only makes sense because you know that each sub-vector has the same length, but in general you could have each vector in the vector with a different number of “columns”.

So the typical workflow would be to create a few helper functions that act on vector<vector<T>> and just invoke those as df.Filter(some_vectorvector_op, {"vectorvector_branch"}).Define("means", means_by_column, {"vectorvector_branch"}); etc.

Cheers,
Enrico

P.S.
Posts in the newbie section get deleted after a while. I think this thread might be useful to other users dealing with RDF+vector<vector>. Can I move it outside of the newbie section?

fstocker · September 10, 2019, 7:14pm

Cool, and then it’s fine if Histo1D("allDaughterMean_dEdX") is filled with each element of the vector of means right?

yes exactly I wanted a distribution of those mean values per column

Turns out that you wanted the mean per column and I didn’t know and even that only makes sense because you know that each sub-vector has the same length, but in general you could have each vector in the vector with a different number of “columns”.

actually my column-length and amount of columns can vary (I realised now) so to be sure I’ll every time call the vector.size() to compute the mean. thanks for the heads up there

I assume that calling specifically these for-loops will increase computing time, right? I hope not significantly, is there no Object in root like RMatrix, that could have these things inside? Maybe it could make sense if looking at TTrees that are double nested like the one I use. just wondering.

Feel free to move this topic where it fits, thank you for your helpful recommendations on the workflow, this will help a lot.

Cheers
francesca

eguiraud · September 10, 2019, 7:23pm

Not at all! At some point there was going to be a loop anyway, also within RVec functions (they just hide it). Hot for loops are a typical performance bottleneck in python, but in C++ they are the one way to run over a vector. In fact, avoiding to write for loops explicitly in python is faster precisely because that way the looping will be pushed down to some low-level C code, which loops fast(er than native python).

You could absolutely store the values in a matrix the moment you create your input file, and then use RDF to process the matrices. ROOT has TMatrix, for example. But if these vector<vector<T>> are not rectangular (each row might have a different length) then you might want to create your own class to represent the data the way you want. Whatever you do RDF should be able to read the data back.

Cheers,
Enrico

fstocker · September 10, 2019, 7:53pm

thanks for your explanations! I see now more how I could tweak or adjust things to my custom needs.

best
Francesca

system · September 24, 2019, 8:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.