Setting kEntriesReshuffled when using RDatasetSpec


Hi all,

I’m trying to use RDatasetSpec to create an RDataFrame from a main tree and a friend tree in PyROOT. Since I’m passing the RDF between different python classes/functions, this seems to be the only way to do this safely w.r.t the python garbage collector (please correct me if I’m wrong!).

When I do this I get the warning

Error in <AddFriend>: Tree 'DecayTree' has the kEntriesReshuffled bit set and cannot have friends nor can be added as a friend unless the main tree has a TTreeIndex on the friend tree 'FriendTree'. You can also unset the bit manually if you know what you are doing; note that you risk associating wrong TTree entries of the friend with those of the main TTree!

If I was loading these with TTree/TChain I would just reset the bit as in this case it is safe to do so (the events in the friend line up with those of the main tree), but with RDatasetSpec and RSample it’s not clear if there is a way to do this?

My current configuration is effectively the same as in the examples in the documentation:

import ROOT as R

path = "file.root"
tree = "DecayTree"

friend_path = "friend_file.root"
friend_tree = "FriendTree"

sample = R.RDF.Experimental.RSample(tree, tree, path)
dataspec = R.RDF.Experimental.RDatasetSpec()
dataspec.AddSample(sample)
dataspec.WithGlobalFriends(friend_tree, friend_path)

ROOT Version: 6.30/04
Platform: Linux
Compiler: Not Provided


Hello Jamie,

I’m not following. Where do you suspect the problems? Usually, if you want to ensure that a C++ object built in Python survives, you can keep a reference around somewhere.

As for suppressing the error, RDatasetSpec probably doesn’t have the support for unsetting that bit. If we solve the above, it might not be necessary to investigate whether this needs to be added, though, so let’s do one iteration of checking whether we can win the race against the garbage collector.

Hi Stephan,

This is all within a python module, which has a submodule io which contains the function configure_rdf, which to simplify for a minimal example looks like this:

import ROOT as R

def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
    _file = R.TFile(path)
    _tree = _file.Get(tree_name)
    tree.ResetBit(R.TTree.kEntriesReshuffled)

    _friend_file = R.TFile(friend_path)
    _friend_tree = _friend_file.Get(friend_tree_name)
    _tree.AddFriend(_friend_tree)

    rdf = R.RDataFrame(tree)
    for cut in cuts: # cuts is a list of any length including 0
        rdf = rdf.Filter(cut) 

    return rdf 

Then in another submodule I define a class which uses this function

from .io import configure_rdf

class example:
    def __init__(self, path, tree_name, friend_path, friend_tree_name, cuts, *args, **kwargs):
        # ... 
        self.rdf = configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts)

        dataset = self.rdf.AsNumpy(["branch"]) # Seems to crash here, also for other methods such as `GetColumnNames()
        # ...

Everything is definitely in scope just before the return of the function and seems that something goes out of scope after.

I also tried attaching the trees to rdf as attributes rdf._tree and rdf._friend_tree but this didn’t solve things.

This is what I would have recommended. Can we have another look as to why the garbage collector seems to have intervened?

Hi Stephan,

I’m trying the following (I added the .SetDirectory(0) and this has at least prevented the segfaults):

def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
    _file = R.TFile(path)
    _tree = _file.Get(tree)
    _tree.SetDirectory(0)
    _tree.ResetBit(R.TTree.kEntriesReshuffled)

    _friend_file = R.TFile(friend_path)
    _friend_tree = _friend_file.Get(friend_tree)
    _friend_tree.SetDirectory(0)
    _tree.AddFriend(_friend_tree)

    rdf = R.RDataFrame(_tree)
    rdf._tree = _tree
    rdf._friend_tree = _friend_tree

    rdf = R.RDataFrame(tree)
    for cut in cuts: # cuts is a list of any length including 0
        rdf = rdf.Filter(cut) 

    return rdf 

I am now met with a new bug, where when I try to run any operation (a simple .Count() for example):

    self.rdf.Count().GetValue()
cppyy.gbl.std.out_of_range: const ULong64_t& ROOT::RDF::RResultPtr<ULong64_t>::GetValue() =>
    out_of_range: RDataFrame: Filter could not retrieve value for column 'B_DTF_Jpsi_MASS' for entry 0. You can use the DefaultValueFor operation to provide a default value, or FilterAvailable/FilterMissing to discard/keep entries with missing values instead.

Which is particularly odd as I print the columns in rdf immediately before and B_DTF_Jpsi_MASS is definitely listed. For additional info, I am running with ROOT.DisableImplicitMT() called explicitly at the start of my script using the example class

Ah I may have solved this now, the .SetDirectory(0) seemed to cause the issue above. I now have the following, which persists _file and _friend_file

def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
    _file = R.TFile(path)
    _tree = _file.Get(tree)
    _tree.ResetBit(R.TTree.kEntriesReshuffled)

    _friend_file = R.TFile(friend_path)
    _friend_tree = _friend_file.Get(friend_tree)
    _tree.AddFriend(_friend_tree)

    rdf = R.RDataFrame(_tree)
    rdf._file = _file
    rdf._tree = _tree
    rdf._friend_file = _friend_file
    rdf._friend_tree = _friend_tree

    rdf = R.RDataFrame(tree)
    for cut in cuts: # cuts is a list of any length including 0
        rdf = rdf.Filter(cut) 

With then importantly that these need to be also assigned as attributes of examples, as the modifications of rdf with Filter and Define drop these attributes at some point:

class example:
    def __init__(self, path, tree_name, friend_path, friend_tree_name, cuts, *args, **kwargs):
        # ... 
        self.rdf = configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts)
        self._file = self.rdf._file
        self._tree = self.rdf._tree
        self._friend_file = self.rdf._friend_file
        self._friend_tree = self.rdf._friend_tree

        dataset = self.rdf.AsNumpy(["branch"]) # Seems to crash here, also for other methods such as `GetColumnNames()
        # ...

Which now works as everything is in scope :tada:

Means that the TTree can no longer find its data … (i.e. it is now ‘useless’).

    return rdf 

As you noted, at least the TFile lifetime needs to be extended/ensured to be the same as the rdf.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.