Home | News | Documentation | Download

Converting a ROOT file including TClonesArray into CSV for XGBoost

Hi experts,

I am trying to convert my ROOT TTree file into a CSV for use with XGBoost. I have tried a few methods which I have found on this site, including ROOT tree to CSV file format, but I haven’t had any luck so far. When I try to use this go-based method I get

root2csv: scanning leaves...
root2csv: >>> "Event_" TLeafElement not supported (struct)
panic: reflect: call of reflect.Value.Type on zero Value

goroutine 1 [running]:
reflect.Value.Type(0x0, 0x0, 0x0, 0x0, 0x30)
	/usr/local/Cellar/go/1.13.8/libexec/src/reflect/value.go:1877 +0x166
go-hep.org/x/hep/groot/rtree.(*tleafElement).Type(0xc00015dc70, 0x1c, 0xc00015bc58)
	/Users/laramason/go/src/go-hep.org/x/hep/groot/rtree/leaf.go:276 +0x4a
main.process(0x7ffeefbff8ba, 0x7, 0x7ffeefbff8d0, 0x28, 0x7ffeefbff8c5, 0x7, 0x0, 0x0)
	/Users/laramason/XGBoost/root2csv/main.go:92 +0x2b2
main.main()
	/Users/laramason/XGBoost/root2csv/main.go:65 +0x20a

I also tried using methods like the script tmva101_Training.py from the ROOT reference guide, but my ROOT file (from Delphes) seems to be more complicated than most examples on the web, as when I try to access the branches using for instance

data_sig = ROOT.RDataFrame("Delphes", signal_filename).AsNumpy()

I get

Error in <TTreeReaderValueBase::CreateProxy()>: The branch Event contains data of type TClonesArray. It cannot be accessed by a TTreeReaderValue<int>

I can access the tree (‘Delphes’) but I can’t figure out how to access the variables such as ‘Muon.pt’ which are inside a second branch ‘Muon’ (for example).

I have included a sample root file here https://www.dropbox.com/s/k3xyf8laetaalxt/ee_Toall_aTotata_161GeV_M6_10GeV.root?dl=0

Any help on how to convert this to CSV or to use it with XGBoost would be greatly appreciated!

Thanks so much,
Lara

ROOT Version: 6.18/04
Platform: OSX

Sorry, here’s a link to the folder that contains the root file! https://www.dropbox.com/sh/vjne9yz89kkn45f/AAB0y9nGXPJzXgK7b-aKO3Dsa?dl=0

As the output of Delphes contains “ragged arrays”, its output doesn’t lend itself readily to flat CSV files.

It can be done, of course, but - I am afraid - not in a completely automated way.

What branches are you interested in?
And how do you plan to feed that data to xgboost?
Perhaps converting directly to the xgboost file format would be more useful?
(Alternatively, if xgboost or the machinery you’re using can, one could convert to the Arrow file format?)

hth

Hi Sebastien, thanks for your answer. I’m interested in the muon, electron and tau branches predominantly (I’m trying to improve an analysis with a final state of hadronic taus and leptons which is currently drowned by background). I’m new to xgboost, but had envisioned something like what was done in this Kaggle Higgs example https://github.com/phunterlau/kaggle_higgs. In all the examples I’ve seen they use a csv file input.

Can I ask what you mean when you suggest converting directly to xgboost file format? This does sound useful!

I know there is a ROOT data analysis framework TMVA - would it be easier to use a Delphes array in that framework? And can xgboost be accessed from there?

Thanks very much for your help!

Cheers,
Lara

ok. I had a look at how Delphes is writing out this TTree.
for the moment groot cannot completely correctly interpret that data (missing handling of TRef, TRefArray, TVector3 and TLorentzVector.)

hopefully, I may have something tomorrow.

Thank you so much for your help!

hi,

so… right now groot's support for TClonesArray isn’t complete enough for being able to correctly understand all of Delphes data (that’s something I definitely need to add but, it’s not something completely trivial :P)

but you can use uproot (a pure Python equivalent of groot, striking a different balance of features than groot):

>>> import uproot
>>> f = uproot.open("./testdata/ee_Toall_aTotata_161GeV_M6_10GeV.root")
>>> t=f.get("Delphes")
>>> t["Event"]
<TBranchElement b'Event' at 0x7fc9205e6610>
>>> t["Muon.PT"].array()[:3]
<JaggedArray [[] [] [57.761646 38.23582]] at 0x7fc9206bf4f0>
>>> t["Event.Number"].array()
<JaggedArray [[0] [1] [2] ... [2497] [2498] [2499]] at 0x7fc91e699400>
>>> len(t["Event.Number"].array())
10000

the python interface of xgboost (here) accepts numpy arrays, so that should be a workable solution.

hth,
-s

Hi Sebastien,

This is great, thank you very much - I thiiiink I will be able to construct a root to csv script from this.

Thanks again for your time taken on this - I really appreciate it.

Cheers,
Lara

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.