A simple, robust and fast interface to read values from ROOT columnar datasets such as TTree, TChain or TNtuple.
Most examples that I opened now on the forum by users and ROOT developers use also TTreeReader.
However, when I open documentation for TTree, there is only one brief mention of TTreeReader in the description of MakeSelector. All examples use branches.
Is any of these methods deprecated or not supported (or will be that in the future)?
Are branches the only way to fill a TTree? (Iâm asking because reading with branches would be more symmetric in this case)
Are there any known performance differences? I read in TTreeReaderâs code that it uses some Proxy Branches; does it mean that it is just a higher level interface to using branches? Is TTreeReader just for some convenience (as I understand, that and RDataFrame are very scalable for other data sets; but what if we speak only about a tree)?
Are any of these ways buggier / less tested? Why is TTreeReader called robust?
Is there any difference between them from Pythonâs point of view?
I donât suggest using TTree directly to read ROOT data, itâs a low-level API, gives very few (type) safety guarantees, performs very few sanity checks and itâs also trickier to get good performance out of it.
For me a low-level API is usually performant; but maybe Enrico could explain.
It seems I canât use TTreeReader from Python, because I canât get the value of a TTreeReaderValue (it uses the operator â*â in C++, and I donât know how translate that to Python; I couldnât find the answer).
ROOT.TTreeReaderValue['double']
age = ROOT.TTreeReaderValue('double')(reader, "Age")
staff_list = []
for entry in itertools.islice(tree, 10):
# this adds just objects, need numbers
staff_list.append(age)
âŚit also reads the whole branch. If you wanted to benefit from the optimization, you would need to do (from Python) SetBranchAddress + GetEntry. Alternatively, you could also use RDataFrame to efficiently read tree data from Python.
Hi @ynikitenko ,
Iâll try to reply to all questions in order but in case I miss some please point them out again:
No
RDataFrame::Snapshot can also be used to write TTrees out.
Under the hood, all interfaces use TTree and TBranch. Raw-TTree-used-naiively can be slow, but if you know what you are doing using TTree directly gives the best (single-thread) performance. The most performant way to use TTrees is to call TBranch::GetEntry on each branch you want to read for each entry rather than calling TTree::GetEntry, and possibly only call TBranch::GetEntry lazily, if strictly needed for a given event, to avoid deserializing a branch value when itâs not needed. TTree::SetBranchAddress is not type-safe, so you can end up reading garbage if you are not careful. TTreeReader helps with that. TTreeReader also helps with only deserializing what you really need, but in many usecases it turns out itâs not as fast as using TTree âsmartlyâ directly. RDataFrame has the same advantages and disadvantages as TTreeReader, but it also makes it very simple to parallelize the event loop, which is not that simple when using raw `TTree.
I donât think there is a lot of difference in test coverage. RDataFrame probably has the best direct coverage, but indirectly RDataFrame tests also test TTreeReader and TTree. I think TTreeReader is called more robust in the docs because of the extra type-safe checks, better error handling and an API that is harder to use incorrectly w.r.t. TTree.
Nothing major comes to mind, but of course your mileage may vary. RDataFrame usage from Python often still requires writing C++ helper functions (for now, a more Pythonic RDF API is on my wish list).
See above, you have to know what you are doing to get good performance out of raw TTree, but itâs what you want if you need to spill out the latest (single-thread) % of ROOT I/O performance.
I donât know why Wim said that TTreeReader was not recommended in Python (in 2014), but I think you can use it just fine. You should be able to use TTreeReaderValue::Get instead of operator*.
Thanks a lot, Enrico!
This perfectly answers the question.
However, it seems I canât use the method Get (and canât find the answer on the internet).
Here is my more complete code
reader = ROOT.TTreeReader(staff_tree)
ROOT.TTreeReaderValue['int']
age = ROOT.TTreeReaderValue('int')(reader, "staff.Age")
for entry in itertools.islice(reader, 10):
print(age.Get())
> âŚ
<cppyy.LowLevelView object at 0x7f5d51f6e500>
When I use
age.Get()[0]
it prints correct ages as numbers! A strange thing is that they are shifted by one (the iteration starts from the 1st entry, not the 0th).
age.__deref__()
works too, but the results are shifted as well.
I iterate as I wrote earlier,
reader = ROOT.TTreeReader(staff_tree)
ROOT.TTreeReaderValue[âintâ]
age = ROOT.TTreeReaderValue(âintâ)(reader, âstaff.Ageâ)
staff_list =
for entry in itertools.islice(reader, 10):
(islice could be omitted). It works differently when I iterate for entry in staff_tree.
# via entryâs attributes
f = TFile(âstaff.rootâ)
staff_tree = f.Get(âTâ)
# correct
for staff_m in itertools.islice(staff_tree, 10):
___print(staff_m.Age, ", ", sep=ââ, end=ââ)
print()
# also correct
reader = ROOT.TTreeReader(staff_tree)
ROOT.TTreeReaderValue[âintâ]
age = ROOT.TTreeReaderValue(âintâ)(reader, âstaff.Ageâ)
reader.Next()
print(age.Get()[0])
reader.Next()
print(age.Get()[0])
You can try it yourself, it should be much faster. The file staff.root is generated by an example macro $ROOTSYS/tutorials/pyroot/staff.py . I say that numbers are âcorrectâ, because they are the same as TTree::Scan. It looks like a bug in TTreeReader in Python. Could you please fix that? (create an issue or just make a PR). Thank you. Sorry that couldnât reply quickly, I was a bit busy now.
Hi @ynikitenko ,
sorry, I am not sure I understand the latest issue.
Does this mean the above code is slower than expected (in which case, what are you comparing it with?) or that the new version of the code should be faster than the old?
The numbers are correct: what looks like a bug? That the iteration with itertools.islice skips one element? If so, can you please check whether simple iteration (for _ in reader: ...) is also broken?
Hi @eguiraud ,
âit should be much fasterâ - I mean that you can copy my code and check anything you want, because the circle âsuggestion at Cern â check in Moscow â new suggestion at Cernâ is definitely longer than if we skip the Moscow part.
Yes, exactly, one element is skipped. This is the problem and a bug (unless it is stated in documentation, which is not the case).
No, islice just takes first 10 elements (I donât want to output them all). It can not effect the first element taken.
Hope that suits you fine. Thank you very much!
Ah, I see, well itâs not necessarily faster if you take into account my own latency (due to other posts, other bugs to fix, other features to implement, etc.), so thanks for your help
This is definitely a bug, now reported as Python iteration over a TTreeReader is broken ¡ Issue #8183 ¡ root-project/root ¡ GitHub . Our PyROOT experts will take a look as soon as possible. In the meanwhile you can use while reader.Next() instead of the for loop to get the correct behavior (or use RDataFrame instead, which has the performance advantage of pushing the loop to C++).
many thanks for the bug report! I would not be able to make such a succinct example of that bug in any reasonable time.
As for me - I use tree iteration and will switch to branch iteration as you suggested, so Iâm not affected by that bug. Anyway Iâm glad to know how to use other methods, and itâs great that there will be one bug less (hopefully) soon.
As for this thread, it looks like all is answered, so it can be closed.