Setting kEntriesReshuffled when using RDatasetSpec


Hi all,

I’m trying to use RDatasetSpec to create an RDataFrame from a main tree and a friend tree in PyROOT. Since I’m passing the RDF between different python classes/functions, this seems to be the only way to do this safely w.r.t the python garbage collector (please correct me if I’m wrong!).

When I do this I get the warning

Error in <AddFriend>: Tree 'DecayTree' has the kEntriesReshuffled bit set and cannot have friends nor can be added as a friend unless the main tree has a TTreeIndex on the friend tree 'FriendTree'. You can also unset the bit manually if you know what you are doing; note that you risk associating wrong TTree entries of the friend with those of the main TTree!

If I was loading these with TTree/TChain I would just reset the bit as in this case it is safe to do so (the events in the friend line up with those of the main tree), but with RDatasetSpec and RSample it’s not clear if there is a way to do this?

My current configuration is effectively the same as in the examples in the documentation:

import ROOT as R

path = "file.root"
tree = "DecayTree"

friend_path = "friend_file.root"
friend_tree = "FriendTree"

sample = R.RDF.Experimental.RSample(tree, tree, path)
dataspec = R.RDF.Experimental.RDatasetSpec()
dataspec.AddSample(sample)
dataspec.WithGlobalFriends(friend_tree, friend_path)

ROOT Version: 6.30/04
Platform: Linux
Compiler: Not Provided


Hello Jamie,

I’m not following. Where do you suspect the problems? Usually, if you want to ensure that a C++ object built in Python survives, you can keep a reference around somewhere.

As for suppressing the error, RDatasetSpec probably doesn’t have the support for unsetting that bit. If we solve the above, it might not be necessary to investigate whether this needs to be added, though, so let’s do one iteration of checking whether we can win the race against the garbage collector.

Hi Stephan,

This is all within a python module, which has a submodule io which contains the function configure_rdf, which to simplify for a minimal example looks like this:

import ROOT as R

def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
    _file = R.TFile(path)
    _tree = _file.Get(tree_name)
    tree.ResetBit(R.TTree.kEntriesReshuffled)

    _friend_file = R.TFile(friend_path)
    _friend_tree = _friend_file.Get(friend_tree_name)
    _tree.AddFriend(_friend_tree)

    rdf = R.RDataFrame(tree)
    for cut in cuts: # cuts is a list of any length including 0
        rdf = rdf.Filter(cut) 

    return rdf 

Then in another submodule I define a class which uses this function

from .io import configure_rdf

class example:
    def __init__(self, path, tree_name, friend_path, friend_tree_name, cuts, *args, **kwargs):
        # ... 
        self.rdf = configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts)

        dataset = self.rdf.AsNumpy(["branch"]) # Seems to crash here, also for other methods such as `GetColumnNames()
        # ...

Everything is definitely in scope just before the return of the function and seems that something goes out of scope after.

I also tried attaching the trees to rdf as attributes rdf._tree and rdf._friend_tree but this didn’t solve things.

This is what I would have recommended. Can we have another look as to why the garbage collector seems to have intervened?