vector<vector<Struct>> and RDataFrame

Hi there,

We have a ROOT TTree with branches that are vector<vector<Struct>> and we would like to use RDataFrames to analyze them.

I have attached a minimum example in SimpleNtuple.C that uses a two-field struct defined in SimpleInfo.hh. The macro creates a tree with branches of various dimensions:

  • a single SimpleInfo object.
  • a vector of SimpleInfo objects
  • a vector of vector of ints
  • a vector of vector of SimpleInfo objects

and I run into problems when I try to display contents from the last of these.

I can print the branch itself:

+-----+----------------------------------------+
| Row | vector_vector_obj                      |
+-----+----------------------------------------+
| 0   | { @0xa9931f0, @0xa9931f8, @0xa993200 } |
|     | { @0xc5ec7a0, @0xc5ec7a8, @0xc5ec7b0 } |
+-----+----------------------------------------+

but when I try to e.g. look at the first vector, I run into issues.

I’ve tried:

  • subscripting: df.Display({"vector_vector_obj[0]"});,
  • asking for an exact field: : df.Display({"vector_vector_obj[0].pdg"});, and
  • creating a new column: auto df2 = df.Define("vector_vector_obj_0", getElement, {"vector_vector_obj"}) and then try to display the new column with df2.Display({"vector_vector_obj_0.pdg"});
    • getElement is defined as:
ROOT::VecOps::RVec<SimpleInfo> getElement(const std::vector<std::vector<SimpleInfo>>& in) {
  return in[0];
}

All of these give some variation of the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Column "vector_vector_obj[0]" is not in a dataset and is not a custom column been defined.

Any help would be much appreciated!

Thanks,
Andy
SimpleInfo.hh (111 Bytes)
SimpleNtuple.C (3.3 KB)


Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: 6.32.06 and 6.30.04
Platform: linuxx8664gcc
Compiler: g++ (Spack GCC) 13.3.0


Hi Andy,
Thank you for your post.
@vpadulan or @mczurylo, could you please have a look?

Dear @aedmonds ,

Thanks for reaching out to the forum! I am sorry that the interface was not immediately clear to use. You had a right intuition when you tried to create a new column with the Define in the third bullet point of your post. I believe the source of confusion is that you are trying to see the content of a specific data member or collection element, whereas Display is an operation that works on the column(s), as per the docs. The error Column "vector_vector_obj[0]" should be thus quite clear. There is indeed no such column as vector_vector_obj[0], the column is just vector_vector_obj. When you Defined the vector_vector_obj_0 you removed one layer and now you can access the first element of the collection at each event as a column, but then you end up with the same problem by trying to Display vector_vector_obj_0.pdg. There is no such column, the column is vector_vector_obj_0.

This behaviour of Display (and in other parts of the API) is actually for a reason. If we allowed arbitrary expressions in such actions, there would be no way to stop much more complicated things from happening, e.g. imagine a call such as Display({"myexpensivecomputation(vector_vector_obj[0])"}).

I believe in your case for fast prototyping the simplest and easiest form to see the pdg of each first vector element is

auto pdgs = df.Define("pdgs", "vector_vector_obj[0].pdg").Take<int>("pdgs")
// pdgs is a std::vector<int>, use it at your will

Cheers,
Vincenzo

Thanks for the explanation, Vincenzo. I understand a bit better now! When I print the column names I get:

pdg
single_obj
single_obj.pdg
single_obj.time
time
vector_obj
vector_obj.pdg
vector_obj.time
vector_vector_int
vector_vector_obj

so I can see why I can’t Display vector_vector_obj.pdg but I can Display vector_obj.pdg

However, when I define vector_vector_obj_0 (i.e. I am taking out one layer of nesting), I see:

vector_obj: type = ROOT::VecOps::RVec<SimpleInfo>
vector_obj.pdg: type = ROOT::VecOps::RVec<Int_t>
vector_obj.time: type = ROOT::VecOps::RVec<Float_t>
...
vector_vector_obj_0: type = ROOT::VecOps::RVec<SimpleInfo>

so vector_obj and vector_vector_obj_0 have the same type but I don’t see columns for vector_vector_obj_0.pdg and .time. Is there a way to get those defined?

By the way, I can’t get your suggestion to work, here are the variations I’ve tried:

  • auto pdgs = df.Define("pdgs", "vector_vector_obj[0].pdg").Take<int>("pdgs");
  • auto pdgs = df.Define("pdgs", "vector_vector_obj_0[0].pdg").Take<int>("pdgs");
  • auto pdgs = df.Define("pdgs", "vector_vector_obj_0[0].pdg").Take<ROOT<RVec<int>>("pdgs");

and they all produce some variation of the error message: error: no member named 'var1' in 'std::vector<SimpleInfo, std::allocator<SimpleInfo> >'

Thanks,
Andy

Dear @aedmonds ,

I have probably misunderstood your dataset schema somehow. I have now checked in your SimpleNtuple.C macro that vector_vector_obj is a column of type std::vector<std::vector<SimpleInfo>>. Thus, vector_vector_obj[0] is a std::vector<SimpleInfo> and thus it cannot possibly have a pdg data member. That needs the extra level of indirection vector_vector_obj[0][0] to get a single SimpleInfo object on which to access the pdg data member. In the end, you are the one who knows best your dataset schema. A simple adaptation of my code from above should give you what you need, let me know how that goes.

Cheers,
Vincenzo

Thanks, Vincenzo. I did eventually get something working but it requires me to write a ROOT::RVec<int> getPdgs(const ROOT::RVec<SimpleInfo>& in) function to create a column of type ROOT::VecOps::RVec<Int_t>. I would have to write these for all member variables in SimpleInfo, which seems inefficient.

I guess my question boils down to: why can RDataFrame read in a branch of type ROOT::VecOps::RVec<SimpleInfo> and create the corresponding columns for each field (e.g. vector_obj.pdg: type = ROOT::VecOps::RVec<Int_t>), but when I Define a new branch of the type ROOT::VecOps::RVec<SimpleInfo> it doesn’t do this?

Thanks,
Andy

Hello @aedmonds,
just to let you know, Vincenzo is absent this week, so it’s best to ping us next week about this.

1 Like

Dear @aedmonds ,

The difference lies in the fact that the RVec<SimpleInfo> that is created from the branch can detect that it is a branch of a tree with data members that can be read independently. Thus each data member is exposed automatically. When you Define a new column in the RDataFrame, this is not a branch of the input ttree, is just another virtual column sitting in memory. As such, there is no infrastructure for RDF to detect that your SimpleInfo, which could really be any custom class, has data members that can be split and read independently, it’s just one single object per collection value.

Cheers,
Vincenzo

Thanks for the explanation, Vincenzo. So in the first case, RDF is seeing an actual branch, but in the second it is basically just seeing a memory address. We might need to revisit our ntuple structure to make it easier for our analyzers…

We could replace our vector<vector<Struct>> branches with vector<Struct> branches in our ntuple but we would still need a way to refer to elements in the new branches from other branches in the ntuple (e.g. the hits that go with a single track).

I understand that the nanoAOD format uses “Idx” branches for this so I’m trying to test something similar. However, I get a bad_alloc exception (that I can’t get a backtrace of) when I try to Display the vector_obj.idxs column. I’ve attached a simple script. Would you mind taking a look and seeing if there’s something obvious I’m doing wrong?

Thanks,
Andy

SimpleInfo.hh (157 Bytes)
SimpleNtupleIdxs.C (1.7 KB)

Tested with:
ROOT Version: 6.32.06
Built for linuxx8664gcc on Feb 25 2025, 21:43:44
From tags/6-32-06@6-32-06

Actually, I just managed to fix it. The vector<int> in the struct should be the first member. This worked fine:

struct SimpleInfo {
  std::vector<int> idxs;
  int pdg = 0;
  float time = 0;
};

but this throws the bad_alloc exception:

struct SimpleInfo {
  int pdg = 0;
  float time = 0;
  std::vector<int> idxs;
};

I’m not sure if this is a bug but let me know if you want me to open a bug report / GitHub issue / whatever

Thanks,
Andy

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.