Reading DAOD_PHYS into RDataFrame

Naive question about reading DAOD_PHYS into RDataFrame.

So I’d like to read information about jets from a DAOD_PHYS file (one of these, for a concrete example; Cern Authentication). DAOD_PHYSLITE files are easy, I can access the column “AnalysisJetsAuxDyn.pt” to get jet pt. For example;

import ROOT; ROOT.xAOD.Init()
r_frame = ROOT.RDataFrame("CollectionTree", "/path/to/PHYSLITE.root")
filtered = r_frame.Filter("AnalysisJetsAuxDyn.pt.size()>0")
print(filtered.Count().GetValue())

And this prints the number of events with at least one jet. Equally, I could access the indices of “AnalysisJetsAuxDyn.pt” to use the jet’s pt to make filter/new columns, etc.

But in the PHYS files, I can see a branch “AntiKt4EMPFlowJetsAux” which has a “pt” leaf that looks great. It plots fine in a TBrowser too. I cannot figure out how to read it in RDataFrame though;


I tried “AntiKt4EMPFlowJetsAux.pt”, but apparently that isn’t what’s needed;

input_line_220:2:35: error: use of undeclared identifier 'AntiKt4EMPFlowJetsAux'
auto lambda0 = [](){return argmax(AntiKt4EMPFlowJetsAux.pt)

Is there a way to access these variables?


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.24/06
Platform: NAME=“CentOS Linux” VERSION=“7 (Core)” ID_LIKE=“rhel fedora” CLUSTER=“sunrise”
Compiler: using pyroot


Hi @Day ,

it looks like RDF does not recognize AntiKt4EMPFlowJetsAux.pt as a valid column name. What does df.GetColumnNames() contain?

Cheers,
Enrico

Hey, many thanks for the prompt reply. To answer your question, df.GetColumnsNames() contains a lot!

[ins] In [12]: all_columns = [str(name) for name in r_frame.GetColumnNames()]
          ...: print(len(all_columns))
          ...: with_akt4 = [name for name in all_columns if "AntiKt4EMPFlowJetsAux" in name]
          ...: print(len(with_akt4))
1601
53

[ins] In [13]: pt_withakt4 = [name for name in with_akt4 if "pt" in name.lower()]

[ins] In [14]: pt_withakt4
Out[14]: 
['AntiKt4EMPFlowJetsAuxDyn.JetConstitScaleMomentum_pt',
 'AntiKt4EMPFlowJetsAuxDyn.JvtRpt',
 'AntiKt4EMPFlowJetsAuxDyn.ActiveArea4vec_pt',
 'AntiKt4EMPFlowJetsAuxDyn.NumTrkPt1000',
 'AntiKt4EMPFlowJetsAuxDyn.NumTrkPt500',
 'AntiKt4EMPFlowJetsAuxDyn.SumPtChargedPFOPt500',
 'AntiKt4EMPFlowJetsAuxDyn.SumPtTrkPt500',
 'AntiKt4EMPFlowJetsAuxDyn.TrackWidthPt1000',
 'AntiKt4EMPFlowJetsAuxDyn.GhostCHadronsFinalPt',
 'AntiKt4EMPFlowJetsAuxDyn.GhostBHadronsFinalPt',
 'AntiKt4EMPFlowJetsAuxDyn.DFCommonJets_QGTagger_truthjet_pt',
 'AntiKt4EMPFlowJetsAuxDyn.NumChargedPFOPt500',
 'AntiKt4EMPFlowJetsAuxDyn.NumChargedPFOPt1000',
 'AntiKt4EMPFlowJetsAuxDyn.ChargedPFOWidthPt1000']

but even after I have narrowed it down to columns with likely names, there is nothing that seems to contain exactly what I want.

Interestingly, I did spot that AntiKt4EMPFlowJetsAux. (dot at end is not a typo) is a column. I tried r_frame.AntiKt4EMPFlowJetsAux..pt but that chokes the jit compiler;

    runtime_error: Failed to tokenize expression:
AntiKt4EMPFlowJetsAux..pt.size() > 0

Additional, potentially relevant info;

[ins] In [18]: r_frame.Display("AntiKt4EMPFlowJetsAux.")
Out[18]: 
<cppyy.gbl.ROOT.RDF.RResultPtr<ROOT::RDF::RDisplay> object at 0x1006e8c0>
[ins] In [19]: r_frame.GetColumnType("AntiKt4EMPFlowJetsAux.")
Out[19]: 'xAOD::JetAuxContainer_v1'

Finally, I tried treating pt as a function;

[ins] In [20]: filtered = r_frame.Filter("AntiKt4EMPFlowJetsAux.pt().size() > 0")
input_line_231:2:28: error: use of undeclared identifier 'AntiKt4EMPFlowJetsAux'
auto lambda0 = [](){return AntiKt4EMPFlowJetsAux.pt().size() > 0
                           ^

Hi @Day ,

indeed it looks like RDF has trouble working with column that have this final dot in the name (which is a special TTree notation for column pertaining to split objects, but I am not quite sure of the details I must admit). Technically RDF should be able to read these types, but there are a couple of ways in which I see it could get confused by these branch hierarchies.

Probably the quickest way to make progress would be if I could play with such a file directly, would you be able to share a file with even just 10 events or so? Sometimes it requires permission from ATLAS, but they are usually ok with sharing a small file (with simulated data) privately with a member of the ROOT team.

You might also be interested in reaching out to Attila Krasznahorkay about getting access to and trying out the latest version of xAOD-DataSource: An implementation of ROOT's RDataSource interface for reading xAOD files through RDataFrame interface. | Zenodo .

Cheers,
Enrico

Hi @eguiraud ,

I can totally see why that would be awkward to handle.

Thanks for the link you sent, I will have a look through that repo and see if there is something there that can fix the column-with-dot-in-name issue.

Otherwise I will check back with my supervisor and see if I can come up with some data/a toy dataset, that is correct DAOD_PHYS data and can be shared.

Many thanks,
Henry

1 Like

The link you gave got me started on a solution, so I wanted to just follow up on this for future readers.
Say you are on lxplus and have called, setupATLAS and asetup Athena 22.0.62, then;

## Example for DAOD_PHYS
import ROOT; ROOT.xAOD.Init(); ROOT.xAOD.JetContainer_v1()
from xAODDataSource import Helpers
df = Helpers.MakexAODDataFrame("example_data/from_lxplus/DAOD_PHYS.28673837._000002.pool.root")
filtered = df.Filter("AntiKt4EMPFlowJets.size() > 0")
pt = filtered.Define("pt", "AntiKt4EMPFlowJets[0]->pt()")
pt.Mean("pt").GetValue()

v.s.

## Example for DAOD_PHYSLITE
import ROOT;ROOT.xAOD.Init()
r_frame = ROOT.RDataFrame("CollectionTree", "example_data/from_lxplus/DAOD_PHYSLITE.28673837._000002.pool.root")
r_frame = r_frame.Filter("AnalysisJetsAuxDyn.pt.size()>0")
r_frame.Count().GetValue()

They have different internal structures of course, so more work is needed if you want cross compatibility.

Thanks Henry!

Out of curiosity, why is cross-compatibility needed in practice? Is this for tooling / libraries (say CP tools or whatever ATLAS calls them)? I’d assume that analyses use either one or the other, not both - am I wrong?

Cheers, Axel

@Axel My supervisor has asked me to write something that works with PHYS format, with a possible interest in moving over to PHYSLITE if it becomes clear that the information we need will be available in PHYSLITE. So my priority is to work with PHYS, but in a way that is sufficiently well structured that a move to PHYSLITE wouldn’t be too painful.

Another advantage of supporting both, is that I can explore what actually is in PHYSLITE. To be honest, I’m not sure what a lot of the new labels mean, but by comparison to PHYS it should be possible to work it out.

Right good point, for transitioning / probing the new format you want to support both. Thanks for explaining!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.