I’m trying to use RDatasetSpec to create an RDataFrame from a main tree and a friend tree in PyROOT. Since I’m passing the RDF between different python classes/functions, this seems to be the only way to do this safely w.r.t the python garbage collector (please correct me if I’m wrong!).
When I do this I get the warning
Error in <AddFriend>: Tree 'DecayTree' has the kEntriesReshuffled bit set and cannot have friends nor can be added as a friend unless the main tree has a TTreeIndex on the friend tree 'FriendTree'. You can also unset the bit manually if you know what you are doing; note that you risk associating wrong TTree entries of the friend with those of the main TTree!
If I was loading these with TTree/TChain I would just reset the bit as in this case it is safe to do so (the events in the friend line up with those of the main tree), but with RDatasetSpec and RSample it’s not clear if there is a way to do this?
My current configuration is effectively the same as in the examples in the documentation:
import ROOT as R
path = "file.root"
tree = "DecayTree"
friend_path = "friend_file.root"
friend_tree = "FriendTree"
sample = R.RDF.Experimental.RSample(tree, tree, path)
dataspec = R.RDF.Experimental.RDatasetSpec()
dataspec.AddSample(sample)
dataspec.WithGlobalFriends(friend_tree, friend_path)
ROOT Version: 6.30/04 Platform: Linux Compiler: Not Provided
I’m not following. Where do you suspect the problems? Usually, if you want to ensure that a C++ object built in Python survives, you can keep a reference around somewhere.
As for suppressing the error, RDatasetSpec probably doesn’t have the support for unsetting that bit. If we solve the above, it might not be necessary to investigate whether this needs to be added, though, so let’s do one iteration of checking whether we can win the race against the garbage collector.
This is all within a python module, which has a submodule io which contains the function configure_rdf, which to simplify for a minimal example looks like this:
import ROOT as R
def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
_file = R.TFile(path)
_tree = _file.Get(tree_name)
tree.ResetBit(R.TTree.kEntriesReshuffled)
_friend_file = R.TFile(friend_path)
_friend_tree = _friend_file.Get(friend_tree_name)
_tree.AddFriend(_friend_tree)
rdf = R.RDataFrame(tree)
for cut in cuts: # cuts is a list of any length including 0
rdf = rdf.Filter(cut)
return rdf
Then in another submodule I define a class which uses this function
from .io import configure_rdf
class example:
def __init__(self, path, tree_name, friend_path, friend_tree_name, cuts, *args, **kwargs):
# ...
self.rdf = configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts)
dataset = self.rdf.AsNumpy(["branch"]) # Seems to crash here, also for other methods such as `GetColumnNames()
# ...
Everything is definitely in scope just before the return of the function and seems that something goes out of scope after.
I also tried attaching the trees to rdf as attributes rdf._tree and rdf._friend_tree but this didn’t solve things.
I’m trying the following (I added the .SetDirectory(0) and this has at least prevented the segfaults):
def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
_file = R.TFile(path)
_tree = _file.Get(tree)
_tree.SetDirectory(0)
_tree.ResetBit(R.TTree.kEntriesReshuffled)
_friend_file = R.TFile(friend_path)
_friend_tree = _friend_file.Get(friend_tree)
_friend_tree.SetDirectory(0)
_tree.AddFriend(_friend_tree)
rdf = R.RDataFrame(_tree)
rdf._tree = _tree
rdf._friend_tree = _friend_tree
rdf = R.RDataFrame(tree)
for cut in cuts: # cuts is a list of any length including 0
rdf = rdf.Filter(cut)
return rdf
I am now met with a new bug, where when I try to run any operation (a simple .Count() for example):
self.rdf.Count().GetValue()
cppyy.gbl.std.out_of_range: const ULong64_t& ROOT::RDF::RResultPtr<ULong64_t>::GetValue() =>
out_of_range: RDataFrame: Filter could not retrieve value for column 'B_DTF_Jpsi_MASS' for entry 0. You can use the DefaultValueFor operation to provide a default value, or FilterAvailable/FilterMissing to discard/keep entries with missing values instead.
Which is particularly odd as I print the columns in rdf immediately before and B_DTF_Jpsi_MASS is definitely listed. For additional info, I am running with ROOT.DisableImplicitMT() called explicitly at the start of my script using the example class
Ah I may have solved this now, the .SetDirectory(0) seemed to cause the issue above. I now have the following, which persists _file and _friend_file
def configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts):
_file = R.TFile(path)
_tree = _file.Get(tree)
_tree.ResetBit(R.TTree.kEntriesReshuffled)
_friend_file = R.TFile(friend_path)
_friend_tree = _friend_file.Get(friend_tree)
_tree.AddFriend(_friend_tree)
rdf = R.RDataFrame(_tree)
rdf._file = _file
rdf._tree = _tree
rdf._friend_file = _friend_file
rdf._friend_tree = _friend_tree
rdf = R.RDataFrame(tree)
for cut in cuts: # cuts is a list of any length including 0
rdf = rdf.Filter(cut)
With then importantly that these need to be also assigned as attributes of examples, as the modifications of rdf with Filter and Define drop these attributes at some point:
class example:
def __init__(self, path, tree_name, friend_path, friend_tree_name, cuts, *args, **kwargs):
# ...
self.rdf = configure_rdf(path, tree_name, friend_path, friend_tree_name, cuts)
self._file = self.rdf._file
self._tree = self.rdf._tree
self._friend_file = self.rdf._friend_file
self._friend_tree = self.rdf._friend_tree
dataset = self.rdf.AsNumpy(["branch"]) # Seems to crash here, also for other methods such as `GetColumnNames()
# ...