Home | News | Documentation | Download

RDataFrame shuffles events for TMVA scoring

Dear ROOT experts,

I’m using RDataFrame with MultiThreading for my analysis for TMVA scoring with ROOT.computeModel, but I found out that RDataFrame mixes events in a new file after processing.

I’m using two categories of files: first is original data and second is BDT scoring output files. My final goal is to use AddFriend with them. Unfortunately, I discovered that:

  1. AddFriend is not possible for the files processed with DRF due to error. It affects both Multithreading and 1 thread modes.
Error in <AddFriend>: Tree 'microtree' has the kEntriesReshuffled bit set, 
and cannot be used as friend nor can be added as a friend unless the main 
tree has a TTreeIndex on the friend tree 'microtree'. You can also unset the bit manually
if you know what you are doing.
  1. When I define reshuffled bit for RDF output file to be false - it doesn’t fix the problem. Events in the output file are already mixed by RDF and I can see that the order of events in the BDT file doesn’t match event’s order in the primary files.

Could you please let me know how we can solve this problem?
Should I use TTreeIndex? If yes, could you please point me at the example with RDF?

Below you can find a simple code which reproduces the problem using RDF.

Best regards, Grigorii.


ROOT Version: 6.22/03
Platform: macOS
Compiler: clang

import ROOT
import array

def ftree(a,b):
    filename = 'tree_with_event_range' +'_%s_%s.root'%(a,b)
    f = ROOT.TFile(filename,"recreate");
    tree = ROOT.TTree("tree", "test")
    eventNumber = array.array('i', [0])
    tree.Branch("eventNumber", eventNumber, "eventNumber/I")
    for i in range(a,b):
        eventNumber[0] = i
        tree.Fill()
    tree.Write()
    f.Close()
    return filename

name1000 = ftree(0,1000) 
name1000_1100 = ftree(1000,1100)
name1100_2100 = ftree(1100,2100)

treeChain = ROOT.TChain('tree')
treeChain.Add(name1000)
treeChain.Add(name1000_1100)
treeChain.Add(name1100_2100)

ROOT.ROOT.EnableImplicitMT(3)
rdtest = ROOT.RDataFrame(treeChain)
rdtest = rdtest.Snapshot('tree','rdfile.root','eventNumber')
ROOT.ROOT.DisableImplicitMT()

treeRDF = ROOT.TChain('tree')
treeRDF.Add('rdfile.root')

expEventNumber = 0
for event in treeRDF:
    if expEventNumber != event.eventNumber:
        print('Error: Expected event number is %s, but we get %s'%(expEventNumber, event.eventNumber))
    expEventNumber = expEventNumber+1
1 Like

Hi @Grigorii_Tolkachev,
and welcome to the ROOT forum!

What happens is that when you read data with a multi-thread RDataFrame program, because of the concurrent nature of multi-thread event loops events will be processed in a different order than the order they had on file. As a consequence, if you use Snapshot in a multi-thread RDF event loop, Snapshot will write events out in a different order than the input TTree.

To prevent silent reads of wrong pairs of events in friend TTrees, the output file of a Snapshot will have a special bit set that says “my events are shuffled with respect to the original TTree I was produced from”. TTrees with this bit set cannot be used as friends of other TTrees unless you manually set this bit to false (you need to set the bit to false when reading back the shuffled TTree). Note that setting the bit to false removes this protection, but does not put the events back in the original order!

A possible workaround is to run the code that writes the new TTree without EnableImplicitMT (at the cost of performance, of course).
Another workaround is to store the BDT scores in a TTree that also stores the corresponding entry number in the input TTree, or also in a different format, and then pre-load the BDT scores in an array that you can then index during processing.

TTreeIndex could also resolve the issue (with a performance penalty) but unfortunately RDataFrame currently does not support TTreeIndex, see https://sft.its.cern.ch/jira/browse/ROOT-9559 and the issue that blocks it, https://sft.its.cern.ch/jira/browse/ROOT-10824 .

Cheers,
Enrico

Hi,

I have the same issue and would like to add weights (not from TMVA but RooStats).
With one of the suggested workarounds of having a TTree with the index and processing this. Which index would this be, _rdfentry or some other variable?
Another question I had is that would we be able to use a ForEach loop to process the weights TTree using RDataFrame, if so would you be able to provide a brief code snippet?

Thanks

Hi @jcob,
and welcome to the ROOT forum.

Yes rdfentry_ would work as an entry index.

In C++, yes (Foreach is not available from Python) – you can do anything with a Foreach. I am not sure what kind of code snippet you are asking for, there is a simple example usage in the docs, another e.g. in this tutorial.

Please create a new topic describing your specific usecase in case you need further help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.