RDataFrame to read indexed friend trees

atasattari · November 2, 2022, 7:24pm

Hi,
There are two groups of TTrees that I would like to read together. For each TTree in the first group, there is a pair in the second group; however, the pair tree is not sorted and might have missing rows. So, I indexed rows using BuildIndex() then paired TTrees using Addfriend() and merged pairs by TChain().
I tried to give the chain to the RDataFrame but I noticed an unexpected behavior. When the parent TTree is the one with missing rows, the event loop looks fine. However, when the parent TTree is the bigger TTree, RDataFrame returns the leftover in the memory for the missing rows of the smaller TTree. Below is a demonstration for a single TTree pair.

# Define some numpy arrays:
run = np.arange(0,10, dtype = np.intc)
event = np.arange(0,10, dtype = np.intc)
rq = np.arange(100,110, dtype = np.intc)
rq_dict = {'run': run, 'event': event,'rq': rq}
rq_df = pd.DataFrame(rq_dict)
# A TTree from the first group
display(rq_df)

rrq_df = rq_df.sample(frac=0.5,random_state =1).drop('rq',axis = 1)
rrq_df['rrq'] = np.random.randint(1, 1000, rrq_df.shape[0])
# A TTree from the second group(missing rows/shuffled)
display(rrq_df)
rrq_dict = {key:np.array(value,dtype = np.intc) for key, value in rrq_df.to_dict('list').items()}

# Makign ROOT files
rrq_df = ROOT.RDF.MakeNumpyDataFrame(rrq_dict)
rrq_df.Snapshot('zip1','rrq.root')
rq_df = ROOT.RDF.MakeNumpyDataFrame(rq_dict)
rq_df.Snapshot('zip1','rq.root')

rq_tree.BuildIndex('run','event') 
rrq_tree.BuildIndex('run','event') 
# Bigger TTree be the parent.
q_tree.AddFriend(rrq_tree)
df = ROOT.RDataFrame(rq_tree)
# Samller TTree be the parent.
rrq_tree.AddFriend(rq_tree)
df = ROOT.RDataFrame(rrq_tree)

Questions:
Is the behavior above expected?
I was hoping to get Nan for the missing rows. Is there a way to achieve that?
I would be happy to hear if there are other ideas on how to deal with such TTrees.

_ROOT Version: 6.26 - PyRoot

eguiraud · November 3, 2022, 10:21am

Hi @atasattari ,

if I understand correctly this seems to be an instance of [Tree] Bogus data silently read when trying to access an indexed friend TTree with an invalid index · Issue #7713 · root-project/root · GitHub , can you confirm?

If so, it is not an RDataFrame issue, rather something that TTree/TChain have always been doing – and I agree it is not desirable. The workaround I know is, as you say, to use the sparser tree as main tree.
@pcanal please confirm or correct me if I’m wrong.

Cheers,
Enrico

atasattari · November 4, 2022, 12:51am

Hi Enrico,
It is the same issue. I noticed the ticket is a year old. By any chance, do you know if there is a plan to fix that?
Thanks,
Ata

eguiraud · November 4, 2022, 2:34pm

Yes, the plan is to fix it! Hopefully for 6.28. Unfortunately resources are limited and since TTree’s behavior has been like this since always that issue never takes the highest priority. @pcanal might be able to provide more context/a timeline.

system · November 18, 2022, 2:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.