Modifying an existing column in RDataFrame

apetukho · July 7, 2019, 9:43am

ROOT Version: 6.16.00
Platform: lxplus

Dear ROOT experts,

I’ve ran into a problem when I was converting my analysis code to RDataFrame. In my code I have a branch named “weight” both in the input and the output tree with the latter being a modified version of the former:

outputTree = TTree(treeName, '')
weight = array('d',[0])
outputTree.Branch("weight", weight, "weight/D")

for event in inputTree:
	weight[0] = event.weight * datasetWeight
	outputTree.Fill()

Is there a way to achieve the same result with RDataFrame? I thought of making an intermediate column for the modified weights, deleting the original “weight” column and than copying the intermediate values to new “weight” column, but I didn’t find a way to delete an existing column.

Thanks in advance.

Pnine · July 7, 2019, 2:59pm

Hi,

I think rdf implements a zero copy policy. modifying in place your weights read through an rvec or as a string (which uses internally rvecs) could well do the job even if I dislike the solution and am sure rdf will allow something better soon

p

eguiraud · July 7, 2019, 3:41pm

Hi,
this is unfortunately not possible, the relevant ticket is ROOT-10165.

apetukho · July 10, 2019, 12:30pm

I’ve found a workaround for how to change the name of the branch: make a column “weightModified”, snapshot it, open the file and then use TTree::SetAlias

tree.SetAlias('weight', 'weightModified')

But now the problem is to use the Python variable to multiply the weight in the ‘Define’ statement. The following code

datasetWeight = 100
rdf.Define('weightModifierd','weight * datasetWeight')

fails to ran and outputs the following error:

input_line_82:2:15: error: use of undeclared identifier 'datasetWeight'
return weight*datasetWeight
              ^
Traceback (most recent call last):
  File "ConvertTreeRDFElab.py", line 105, in <module>
    outputTree = ConvertTree(inputTree, outputFileName, outputTreeName)
  File "ConvertTreeRDFElab.py", line 60, in ConvertTree
    .Define('weightModified', 'weight*datasetWeight')
Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Define(experimental::basic_string_view<char,char_traits<char> > name, experimental::basic_string_view<char,char_traits<char> > expression) =>
    Cannot interpret the following expression:
weight*datasetWeight

I’ve found those two threads but it’s still not clear how to adapt their solutions to my problem.
https://root-forum.cern.ch/t/add-new-column-to-rdataframe/
https://root-forum.cern.ch/t/rdataframe-define-column-of-same-constant-value/

How can I use already defined Python variable to multiply the values in the existing RDF columns?

Thanks in advance.

eguiraud · July 10, 2019, 1:07pm

Hi,
work is in progress towards better integration of RDF with python. Until then, Define expressions must be valid C++.

In your case, that could be either

datasetWeight = 100
rdf.Define('weightModified','weight * {}'.format(datasetWeight))

or

datasetWeight = 100
rdf.Define('weightModified', 'weight * int(TPython::Eval("datasetWeight"))')

Hope this helps!
Enrico

system · July 24, 2019, 1:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.