Dear experts,
I have been trying to snapshot a file using friend trees in the multithread mode and also distributing the jobs with Spark clusters.
The input file has a nominal tree with vector branches for tracks and other branches do not include the vector branch to reduce file size.
The nominal tree has a looser requirement applied so the number of events for the nominal tree is larger than the events in other trees.
I will use two branches for an example and the contents would be like
- Nominal tree
- EventNumber
- TrackVariables (RVec branch)
- EventVariables (jets, leptons etc.)
- Sys tree
- EventNumber
- EventVariables (jets, leptons etc. but the values are slightly varied compared to ①)
I would like to attach the Track variables in ① to tree ② indexed by the EventNumber and snapshot it into a root file.
The code below is an example which uses the multithread mode and executed by using the SWAN service provided by CERN.
import ROOT
filepath = "root://eosuser//eos/atlas/unpledged/group-tokyo/users/ymino/SUSY_Dataset/NTuple/user.silu.Signal_mc16a.500963.e8253_a875_r9364_p5243_v2.6_DT_allSys_reducedJES_rel21_t2_tree.root/user.silu.32222173._000001.tree.root"
fin = ROOT.TFile( filepath )
## Read nominal tree with track vector
tchain = ROOT.TChain()
tchain.Add(filepath + "/" + "tree_NoSys")
## Read systematic tree with no track vector
tchain_sys = ROOT.TChain()
tchain_sys.Add(filepath + "/" + "tree_JET_Flavor_Response__1up")
## Build index using EventNumber branch common between tchain & tchain_sys
tchain_sys.BuildIndex("EventNumber")
tchain.AddFriend( tchain_sys, "sys")
ROOT.gInterpreter.GenerateDictionary("ROOT::VecOps::RVec<TString>","ROOT/RVec.hxx")
ROOT.ROOT.EnableImplicitMT()
rdf = ROOT.RDataFrame( tchain )
rdf = rdf.Define("sys_EventNumber","sys.EventNumber")
rdf.Snapshot("test","test.root",{"EventNumber","sys_EventNumber","trkD0"})
When I scan the output it seems that the EventNumbers are not indexed correctly and different events are saved in the rootfile as below.
root [5] test->Scan("EventNumber:sys_EventNumber:trkD0[0]","","colsize=15 col=::20.3")
***********************************************************************
* Row * EventNumber * sys_EventNumber * trkD0[0] *
***********************************************************************
* 0 * 1929 * 1929 * -0.0592 *
* 1 * 1621 * 1356 * -0.0398 * <------- Events shifted from this row
* 2 * 1356 * 1797 * -0.564 *
* 3 * 1797 * 1217 * -0.119 *
* 4 * 1217 * 399 * 0.14 *
* 5 * 399 * 1554 * 0.109 *
* 6 * 1554 * 544 * -0.148 *
* 7 * 544 * 402 * 0.0267 *
* 8 * 402 * 941 * 0.0345 *
* 9 * 941 * 1178 * 0.0425 *
* 10 * 1178 * 226 * 0.0377 *
Using the single thread mode, I can get the results I want.
root [1] test->Scan("EventNumber:sys_EventNumber:trkD0[0]","","colsize=15 col=::20.3")
***********************************************************************
* Row * EventNumber * sys_EventNumber * trkD0[0] *
***********************************************************************
* 0 * 1929 * 1929 * -0.0592 *
* 1 * 1621 * 1929 * -0.0398 * <------- Previous event is filled from the sys tree (But this is OK for me.)
* 2 * 1356 * 1356 * -0.564 *
* 3 * 1797 * 1797 * -0.119 *
* 4 * 1217 * 1217 * 0.14 *
* 5 * 399 * 399 * 0.109 *
* 6 * 1554 * 1554 * -0.148 *
* 7 * 544 * 544 * 0.0267 *
* 8 * 402 * 402 * 0.0345 *
* 9 * 941 * 941 * 0.0425 *
* 10 * 1178 * 1178 * 0.0377 *
Is there a method to reed friend trees which have an index list using RDataFrame in multithread mode of using spark clusters ? (Or only achievable by using single thread mode ?)
ROOT Version: JupyROOT 6.26/08
Platform: CentOS 7
Compiler: gcc11