Hi
I am using RDataFrame with ROOT v06.16.
I analyse a nested TTree that contains some MC data. Branches contain a single entry for primary particles, (coordinates, length of track, etc.).

For each primary particle the associated daughters are stored in vectors (have different sizes of course) with their properties like the startX coordinate:

The dEdx for each particle is stored for all points on the track, so for a primary particle it is a vector, for a daughter particle it is a matrix vector vector double

Iâ€™d like to create a new branch containing the MEAN dEdx for each daughter, so it should be a vector. The value needs to be computed from their corresponding dEdx vectors in the matrix but I canâ€™t figure out how to navigate through the vector<vector> variable and I always get errors.

Iâ€™d also like to filter for certain dEdx values that are contained in the matrix.

Has someone ever encountered this and has succesfully solved it?

Also a more general question:
RDataFrame looks very appealing to analyse TTreeâ€™s in a very dynamic way. Is it the now recommended framework to perform a data Analysis? If not, which one is better suited? TTreeReader?

Hi Francesca,
what are the errors you are encountering and how does the relevant part of the code look like? (it would be great if you could share a minimal reproducer of the problem).

Iâ€™d definitely recommend RDataFrame for most usecases (full disclosure, I am one of the authors ). Also fyi, RDataFrame leverages TTreeReader internally, it just adds a few features and tries to offer a nicer syntax.

Error in <TTreeReaderArrayBase::CreateContentProxy()>: The branch reco_daughter_dEdX contains data of type vector<double>. It cannot be accessed by a TTreeReaderArray<double>

please find attached the snap.C macro that tries to read the snap_pionana.root with 4 relevant branches. it should reproduce the error when trying to compile and run

I see,
the problem is that allDaughterMean_dEdX is a vector<vector<double>> as you said, but RVec can only â€śhideâ€ť one level of for loops, i.e. RVec<double> reads vector<double> fine, but you need RVec<vector<double>> for a vector<vector<double>>.

Unfortunately that means that youâ€™ll have to write the for loop that evaluates the mean of the vector explicitly.

Something like this should work or at least give you an idea:

#include "TCanvas.h"
#include "TROOT.h"
#include "TGraphErrors.h"
#include "TH1.h"
#include "TLegend.h"
#include "TArrow.h"
#include "TLatex.h"
#include "TMath.h"
#include <ROOT/RDataFrame.hxx>
#include <iostream>
#include <math.h>
#include <string.h>
#include <stdio.h>
//using RDataFrame to cut and analyse PionTtrr
using namespace std;
using namespace ROOT::VecOps;
double meanVectorVector(const ROOT::RVec<vector<double>> &reco_daughter_dEdX){
double mean_sum = 0;
unsigned int mean_count = 0;
for (auto v : reco_daughter_dEdX) {
for (auto e : v) {
mean_sum += e;
++mean_count;
}
}
return mean_sum / mean_count;
}
//***********************
//Main Function
int snap(){
ROOT::RDataFrame frame("myTree", "snap_pionana.root");
TFile *output = new TFile("output.root", "RECREATE");
//Lambda for Mean of the primary Particle dEdX (of type vector<double>), that works well
auto meanPrimarydEdX = [](const ROOT::RVec<double> &dEdX){return Mean(dEdX);};
//Histos
//This one Works:
auto primarydEdX = frame.Define("primMean_dEdX", meanPrimarydEdX, {"dEdX"}).Histo1D("primMean_dEdX");
//This doesn't
auto allDaughterdEdX = frame.Define("allDaughterMean_dEdX",meanVectorVector,{"reco_daughter_dEdX"}).Histo1D("allDaughterMean_dEdX");
primarydEdX->Draw();
primarydEdX->Write();
allDaughterdEdX->Draw();
allDaughterdEdX->Write();
return 0;
}

Actually, I just realised: your macro returns the mean of all the entries in the vector<vector<double>>. I was trying to access the mean of each column (as each column in the vector<vector<double>> is for the corresponding daughter particle. So the function should return a vector with the mean of each column, not just one value.

I was able to change the function correctly and this function does what I was looking for (just posting it for completeness)

std::vector<double> meanVectorVector(const ROOT::RVec<vector<double>> &reco_daughter_dEdX){
std::vector<double> mean;
double mean_sum = 0;
unsigned int mean_count = 0;
for (auto v : reco_daughter_dEdX) {
for (auto e : v) {
mean_sum += e;
++mean_count;
}
mean.push_back(mean_sum/mean_count);
mean_sum = 0;
mean_count = 0;
}
return mean;
}

Cool, and then itâ€™s fine if Histo1D("allDaughterMean_dEdX") is filled with each element of the vector of means right?

Iâ€™m afraid soâ€¦RVec<T> is a nicer vector<T>: if T is itself a vector<U>, suddenly most of RVec's helper functions donâ€™t know what to do: whatâ€™s the mean of a vector of vectors? Turns out that you wanted the mean per column and I didnâ€™t know and even that only makes sense because you know that each sub-vector has the same length, but in general you could have each vector in the vector with a different number of â€ścolumnsâ€ť.

So the typical workflow would be to create a few helper functions that act on vector<vector<T>> and just invoke those as df.Filter(some_vectorvector_op, {"vectorvector_branch"}).Define("means", means_by_column, {"vectorvector_branch"}); etc.

Cheers,
Enrico

P.S.
Posts in the newbie section get deleted after a while. I think this thread might be useful to other users dealing with RDF+vector<vector>. Can I move it outside of the newbie section?

Cool, and then itâ€™s fine if Histo1D("allDaughterMean_dEdX") is filled with each element of the vector of means right?

yes exactly I wanted a distribution of those mean values per column

Turns out that you wanted the mean per column and I didnâ€™t know and even that only makes sense because you know that each sub-vector has the same length, but in general you could have each vector in the vector with a different number of â€ścolumnsâ€ť.

actually my column-length and amount of columns can vary (I realised now) so to be sure Iâ€™ll every time call the vector.size() to compute the mean. thanks for the heads up there

I assume that calling specifically these for-loops will increase computing time, right? I hope not significantly, is there no Object in root like RMatrix, that could have these things inside? Maybe it could make sense if looking at TTrees that are double nested like the one I use. just wondering.

Feel free to move this topic where it fits, thank you for your helpful recommendations on the workflow, this will help a lot.

Not at all! At some point there was going to be a loop anyway, also within RVec functions (they just hide it). Hot for loops are a typical performance bottleneck in python, but in C++ they are the one way to run over a vector. In fact, avoiding to write for loops explicitly in python is faster precisely because that way the looping will be pushed down to some low-level C code, which loops fast(er than native python).

You could absolutely store the values in a matrix the moment you create your input file, and then use RDF to process the matrices. ROOT has TMatrix, for example. But if these vector<vector<T>> are not rectangular (each row might have a different length) then you might want to create your own class to represent the data the way you want. Whatever you do RDF should be able to read the data back.