Reading friend trees using RDataFrame in Multithread mode or Spark cluster

ymino · February 7, 2023, 10:19pm

Dear experts,

I have been trying to snapshot a file using friend trees in the multithread mode and also distributing the jobs with Spark clusters.
The input file has a nominal tree with vector branches for tracks and other branches do not include the vector branch to reduce file size.
The nominal tree has a looser requirement applied so the number of events for the nominal tree is larger than the events in other trees.
I will use two branches for an example and the contents would be like

Nominal tree

EventNumber
TrackVariables (RVec branch)
EventVariables (jets, leptons etc.)

Sys tree

EventNumber
EventVariables (jets, leptons etc. but the values are slightly varied compared to ①)

I would like to attach the Track variables in ① to tree ② indexed by the EventNumber and snapshot it into a root file.
The code below is an example which uses the multithread mode and executed by using the SWAN service provided by CERN.

import ROOT

filepath = "root://eosuser//eos/atlas/unpledged/group-tokyo/users/ymino/SUSY_Dataset/NTuple/user.silu.Signal_mc16a.500963.e8253_a875_r9364_p5243_v2.6_DT_allSys_reducedJES_rel21_t2_tree.root/user.silu.32222173._000001.tree.root"
fin = ROOT.TFile( filepath )
## Read nominal tree with track vector
tchain = ROOT.TChain()
tchain.Add(filepath + "/" + "tree_NoSys")
## Read systematic tree with no track vector
tchain_sys = ROOT.TChain()
tchain_sys.Add(filepath + "/" + "tree_JET_Flavor_Response__1up")
## Build index using EventNumber branch common between tchain & tchain_sys
tchain_sys.BuildIndex("EventNumber")
tchain.AddFriend( tchain_sys, "sys")

ROOT.gInterpreter.GenerateDictionary("ROOT::VecOps::RVec<TString>","ROOT/RVec.hxx")

ROOT.ROOT.EnableImplicitMT()
rdf = ROOT.RDataFrame( tchain )
rdf = rdf.Define("sys_EventNumber","sys.EventNumber")
rdf.Snapshot("test","test.root",{"EventNumber","sys_EventNumber","trkD0"})

When I scan the output it seems that the EventNumbers are not indexed correctly and different events are saved in the rootfile as below.

root [5] test->Scan("EventNumber:sys_EventNumber:trkD0[0]","","colsize=15 col=::20.3")
***********************************************************************
*    Row   *     EventNumber * sys_EventNumber *             trkD0[0] *
***********************************************************************
*        0 *            1929 *            1929 *              -0.0592 *
*        1 *            1621 *            1356 *              -0.0398 * <------- Events shifted from this row
*        2 *            1356 *            1797 *               -0.564 *
*        3 *            1797 *            1217 *               -0.119 *
*        4 *            1217 *             399 *                 0.14 *
*        5 *             399 *            1554 *                0.109 *
*        6 *            1554 *             544 *               -0.148 *
*        7 *             544 *             402 *               0.0267 *
*        8 *             402 *             941 *               0.0345 *
*        9 *             941 *            1178 *               0.0425 *
*       10 *            1178 *             226 *               0.0377 *

Using the single thread mode, I can get the results I want.

root [1] test->Scan("EventNumber:sys_EventNumber:trkD0[0]","","colsize=15 col=::20.3")
***********************************************************************
*    Row   *     EventNumber * sys_EventNumber *             trkD0[0] *
***********************************************************************
*        0 *            1929 *            1929 *              -0.0592 *
*        1 *            1621 *            1929 *              -0.0398 * <------- Previous event is filled from the sys tree (But this is OK for me.)
*        2 *            1356 *            1356 *               -0.564 *
*        3 *            1797 *            1797 *               -0.119 *
*        4 *            1217 *            1217 *                 0.14 *
*        5 *             399 *             399 *                0.109 *
*        6 *            1554 *            1554 *               -0.148 *
*        7 *             544 *             544 *               0.0267 *
*        8 *             402 *             402 *               0.0345 *
*        9 *             941 *             941 *               0.0425 *
*       10 *            1178 *            1178 *               0.0377 *

Is there a method to reed friend trees which have an index list using RDataFrame in multithread mode of using spark clusters ? (Or only achievable by using single thread mode ?)

ROOT Version: JupyROOT 6.26/08
Platform: CentOS 7
Compiler: gcc11

eguiraud · February 8, 2023, 9:51am

Hi @ymino ,

and welcome to the ROOT forum!

Indexed friend trees (i.e. friend trees that make use of BuildIndex) are not supported in distributed mode, but they are supported in multi-thread mode – so what you are seeing looks like a bug.

Can you please share the inputs with me (even privately) so I can debug what’s happening?

Many thanks,
Enrico

ymino · February 8, 2023, 12:52pm

Hi @eguiraud,

Thank you for checking this issue and glad to hear that it should work correctly in the multi-thread mode. The inputs are too large to upload here so I added a link to the input files.
https://cernbox.cern.ch/s/SDdHcczHNxicvQn

eguiraud · February 8, 2023, 6:47pm

Hi @ymino ,

turns out that using indexed friends with multi-threading activated is completely broken: [DF] Bogus data read from indexed friend trees in multi-thread runs · Issue #12260 · root-project/root · GitHub – I’ll provide a patch as soon as possible (I think tomorrow).

Thank you for the report, this is a nasty bug!
Cheers,
Enrico

ymino · February 8, 2023, 10:54pm

Hi @eguiraud,

Thank you for the quick action. I’m looking forward to seeing the patched version.

Cheers,
Yuya

eguiraud · February 9, 2023, 2:05pm

Hi @ymino ,

this patch should fix the problem with RDF MT, while this other one introduces an explicit error if indexed friends are used with distributed RDF.

They will both be included in the next patch release, v6.28.02.

Cheers,
Enrico

ymino · February 9, 2023, 3:42pm

Hi @eguiraud,

Thank you for the bug fix and for making the patch.
By the way, do you know when the root v6.28.02 will be available with centos7 & gcc11?
When looking at the root versions with lsetup root, I only see the v6.26.08 as the latest version. For example, the Vary functions with the distributed RDF are only available from v6.28 and I cannot use them currently.
Sorry if this topic is unrelated to you, but I wanted to know when this patched version will be available with centos7 & gcc11.

eguiraud · February 9, 2023, 7:20pm

6.28.00 has already been released, and 6.28.02 will probably be released in a few weeks (also because of this bad bug that we need to patch).

About lsetup support, you should ask lsetup maintainers

Cheers,
Enrico

eguiraud · February 20, 2023, 8:57am

This is now fixed in the master branch (future 6.30 release) and the v6-28-00-patches branch (future v6.28.02).

Cheers,
Enrico

ymino · February 20, 2023, 9:30pm

Hi @eguiraud,

Great to hear that this patch will be available very soon.
I’ll try to see whether the new version will fulfill my requirements as soon as it is released.
Thank you for the quick response and updates.

Best regards,
Yuya MINO

system · March 6, 2023, 9:31pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.