I am trying to figure out a way to loop over “unique” events in a Draw statement. I define “unique” events to have a unique pair of values of “run” and “event” branches, for example. I know I can do it with a manual loop using std::set and constructing a TEventList, but I’m trying to avoid such explicit loops.
I naively expected TTreeIndex to handle this behind the scenes giving major and minor indices to run and event, but it seems to not work. This is probably by design (?). I have more than one root file in my current directory and I know for sure there are duplicate (event,run) pairs amongst them.
Example code with what I expect to be printed out:
import ROOT as r
c1 = r.TChain("t")
c1.Add("*.root")
print c1.Draw("1","1") # should be sum of events in root files
index = r.TTreeIndex(c1, "run","event")
print index.GetN() # should be "unique" number of events
c1.SetTreeIndex(index)
print c1.Draw("1","1") # should be "unique" number of events
Yet, all 3 numbers that get printed out are the same.
Can you recommend a way to quickly draw quantities for only the unique events? Hope I’m not being vague.
did you have a look to the TDataFrame ? It is not clear to me what you are actually trying to achieve. Do you want to select in your chain all the events characterised by a certain Run and Event Number ?
If I can naively compare TDataFrame to the dataframe used by pandas, for example, then I’m sure there’s a clean way to do what I want.
Basically, I have events coming from two datasets (and thus, let’s say, two root files) which can have overlapping events. The unique identifiers are (run#,event#). At the end of the day, I’d like to have some kind of TChain that transparently considers only unique events so that I don’t double count when filling histograms in Draw statements.
TDataFrame is currently unavailable to me because the environment I am restricted to requires v6.02 of ROOT. Of course, I could try to make it work, but I’m just wondering if there’s a way to do it without TDataFrame.
my recommendation would be to upgrade version: I understand that there are constraints but 6.02 is quite an old release.
Said that, there are many ways to achieve what we are discussing. I think the key point is to implement a way to discard an event if it has been already “used” to fill an histogram (or perform whatever action). If we think about the pair of values run-event, the simplest data structure that comes to rescue is the std::set.
More concretely:
std::set<pair<unsigned int, unsigned int>> analysedEvents;
// here we start the event loop, e.g. with TTreeReader
while (myReader.Next()) {
auto run = *run_readerValue;
auto evt = *evt_readerValue;
// Skip if event already studied
if (!analysedEvents.insert({run, evt}).second) continue; // insert returns a pair. The second element is true if the element inserted was not in the set, false otherwise
// Here do the work, e.g. fill histos...
...
}
depending on the size of the set, you may want to try out std::unordered_set too just to check if there is a sizable performance benefit.
I hope that helps.
I am indeed afraid that TTree::Draw cannot help you here as something like TDataFrame.
Just remember to deactivate the branches you do not use in your Python loop. That will drastically speedup the program cutting off unnecessary decompression/deserialisation.
Let us know if there are issues.