Some questions about snapshot behaviour

mmaneyro · October 5, 2023, 1:48pm

Hi,
I used the attached macro to Redefine() a series of columns on my RDataFrame. However, using snapshot to save to file saves the old columns as well as the redefined ones, leading to duplicated data. Can this be avoided?

When saving the redefined columns Snapshot changes the “.” characters in the column names to underscores. I suppose this has to do with the validity of strings containing dots as C++ variable names. Is there a way to get around this to keep my original column names? The dots come from the original TTree and I would like to maintain the format if possible.

Thanks in advance for your help.

snapshot_friend_test.C (6.8 KB)

_ROOT Version: 6.28/00
Platform: Not Provided
Compiler: Not Provided

ETA: I think both problems are actually closely related, as the columns that were not renamed by Snapshot (because they didn’t contain “.”) do not show up twice.

bellenot · October 5, 2023, 2:00pm

Maybe @vpadulan knows

eguiraud · October 11, 2023, 8:35pm

Hi @mmaneyro ,

sorry for the delay.

I don’t think there is a way to Redefine a branch with a . in its name and save it to a new file with Snapshot keeping the name with the . – it’s not supported.

About Snapshot saving both the original branch with the . in the name and the new one changing the . with _, that’s surprising and sounds like a bug. Feel free to open an issue, or maybe @vpadulan or @mczurylo could take a look.

Sorry I don’t have better news!
Enrico

mmaneyro · October 11, 2023, 9:23pm

Hi Enrico,
Thank you for the reply. Unfortunately I figured this would be the case.
The renaming and column duplicates are easy enough to work around. What has been giving me trouble is that the Redefined columns save as RVec, and not as the original type of the branches (Int_t, Double_t), etc. The leaves also don’t get saved within the branches they previously belonged to. I know something “breaks” during the renaming. Saving an unmodified tree doesn’t give any issues with the tree structure.

My main issue is that I need to use the merged tree I generate as the input on another macro, and with this troubled structure calling a tree entry doesn’t seem to properly give me the data I need (it seems calls the entire RVec). This is what I get from TTree->Show() (for the tree produced) using snapshot.

======> EVENT:0
 Event_Nparticles = (ROOT::VecOps::RVec<int>*)0x55d85efdb070
 Event_ScalePDF  = (ROOT::VecOps::RVec<double>*)0x55d85fc7c000
 Event_CouplingQED = (ROOT::VecOps::RVec<double>*)0x55d85f12a190
 Event_CouplingQCD = (ROOT::VecOps::RVec<double>*)0x55d85fce75a0
 Rwgt_fUniqueID  = (ROOT::VecOps::RVec<unsigned int>*)0x55d860153230
 Rwgt_fBits      = (ROOT::VecOps::RVec<unsigned int>*)0x55d85fa7e4a0
 Rwgt_Weight     = (ROOT::VecOps::RVec<double>*)0x55d85f620170
 Rwgt_size       = (ROOT::VecOps::RVec<int>*)0x55d85f6cdb10
 Particle_fUniqueID = (ROOT::VecOps::RVec<unsigned int>*)0x55d85f30f710
 Particle_PID    = (ROOT::VecOps::RVec<int>*)0x55d85db07b90
 Particle_Status = (ROOT::VecOps::RVec<int>*)0x55d859ada960
 Particle_Mother1 = (ROOT::VecOps::RVec<int>*)0x55d8600826e0
 Particle_Mother2 = (ROOT::VecOps::RVec<int>*)0x55d85f9f57e0
 Particle_ColorLine1 = (ROOT::VecOps::RVec<int>*)0x55d85f58ffa0
 Particle_ColorLine2 = (ROOT::VecOps::RVec<int>*)0x55d856d24ef0
 Particle_Px     = (ROOT::VecOps::RVec<double>*)0x55d85f8e05a0
 Particle_Py     = (ROOT::VecOps::RVec<double>*)0x55d85f4f5440
 Particle_Pz     = (ROOT::VecOps::RVec<double>*)0x55d8601cdb20
 Particle_E      = (ROOT::VecOps::RVec<double>*)0x55d85f5f3ed0
 Particle_M      = (ROOT::VecOps::RVec<double>*)0x55d85f7ccfe0
 Particle_PT     = (ROOT::VecOps::RVec<double>*)0x55d85f736e70
 Particle_Eta    = (ROOT::VecOps::RVec<double>*)0x55d85f009e30
 Particle_Phi    = (ROOT::VecOps::RVec<double>*)0x55d85f9f9750
 Particle_Rapidity = (ROOT::VecOps::RVec<double>*)0x55d8600ece30
 Particle_LifeTime = (ROOT::VecOps::RVec<double>*)0x55d85f03c7d0
 Particle_Spin   = (ROOT::VecOps::RVec<double>*)0x55d85ec3d130
 Particle_size   = (ROOT::VecOps::RVec<int>*)0x55d85fe65eb0
 Event           = 1
 Event.fUniqueID = 0
 Event.fBits     = 50331648
 Event.Number    = 1
 Event.ProcessID = 1
 Event.Weight    = 4.37479e+06
 Event_size      = 1
 Particle        = 5
 Particle.fBits  = 50331648, 50331648, 50331648, 50331648, 50331648

Assuming there is no way around this from the snapshot I’ve been using Rvec[entry][item]. It sems to work, is this the correct way to call the value I need?

Thanks

eguiraud · October 11, 2023, 11:54pm

Ah, yes we should add a note in the documentation about what Snapshot does with C-style arrays: when you have branches in the input tree containing C-style arrays (int*, Double_t* like I understand is your case) then Snapshot is able to write them out again as C-style arrays if the values come from the original branches.

Otherwise, RDataFrame has to go through the intermediate RVec representation (as it is the case with your Redefine-d columns): RDataFrame operations (other than Snapshot in that particular case described above) do not handle C-style arrays, they are converted to RVecs. And Snapshot will then write these columns out as RVecs, because that’s their type now.

I don’t know, I’m missing a lot of context. Feel free to provide a sample input file and a self-contained, stripped down reproducer that I can take a look at.

Cheers,
Enrico

mmaneyro · October 12, 2023, 12:22am

Thank you, I have already figured out how adapt the other macro to call the RVecs. Working through some details that I still need to fix but I think at this point they’re unrelated to the TTree type/structure.
Just to confirm, I can’t do an operation of this form:

auto append_func_call_int=[](ROOT::VecOps::RVec<int> inputArray1,ROOT::VecOps::RVec<int> inputArray2){
        const auto size = inputArray2.size();
        for (size_t i = 0; i < size; i++)
           inputArray1.emplace_back(inputArray2[i]);
        return inputArray1;};

on tree branches without relying on RDataFrame, right? If so bypassing the details that come up through Snapshotting is basically the only way to rewrite my tree the way I need to, so I can just stick with that even if it’s not “pretty”.
Best regards

eguiraud · October 13, 2023, 5:56pm

As I understand the snippet, you want to take a TTree with branches A and B which contain collections of integers, and produce a new tree in which A, for every event, is the concatenation of the original A and B.

You can do that by reading and writing the TTrees directly, with tree->SetBranchAddress and tree->Branch calls, but it’s more convoluted (it’s what RDF does under the hood).

P.S.
note that you can just write the snippet as return Concatenate(inputArray1, inputArray2). RVecs have a lot of useful helper functions.

system · October 27, 2023, 5:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.