RDataFrame Python string formatting

Dear All,

I came across a curious issue just now. :confused: While trying to make some tests from the command line on the Python prompt, I was trying to execute the following:

Python 2.7.15 (default, Jan  6 2020, 03:17:10) 
[GCC 8.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> from xAODDataSource.Helpers import MakexAODDataFrame
>>> ROOT.xAOD.Init().ignore()
xAOD::Init                INFO    Environment initialised for data access
>>> df = MakexAODDataFrame( '${ASG_TEST_FILE_DATA}' )
>>> df2 = df.Define( 'run_number', 'return EventInfo.runNumber();' )
input_line_309:2:8: error: expected expression
return return EventInfo.runNumber();
       ^
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > expression) =>
    Cannot interpret the following expression:
return EventInfo.runNumber();

Make sure it is valid C++. (C++ exception of type runtime_error)
>>>

This confused me a lot, so I wrote it up into a small shell script. And I found that the following works just fine:

import ROOT
from xAODDataSource.Helpers import MakexAODDataFrame

ROOT.xAOD.Init().ignore()

df = MakexAODDataFrame( '${ASG_TEST_FILE_DATA}' )
df2 = df.Define( 'run_number',
                 '''
                 return EventInfo.runNumber();
                 ''' )
hist = df2.Histo1D( 'run_number' )

hist.Print()
Singularity> python test.py
xAOD::Init                INFO    Environment initialised for data access
TH1.Print Name  = run_number, Entries= 2157, Total sum= 2157
Singularity>

But the formalism that I tried on the command line as well, doesn’t.

import ROOT
from xAODDataSource.Helpers import MakexAODDataFrame

ROOT.xAOD.Init().ignore()

df = MakexAODDataFrame( '${ASG_TEST_FILE_DATA}' )
df2 = df.Define( 'run_number',
                 'return EventInfo.runNumber();' )
hist = df2.Histo1D( 'run_number' )

hist.Print()
Singularity> python test.py
xAOD::Init                INFO    Environment initialised for data access
input_line_308:2:8: error: expected expression
return return EventInfo.runNumber();
       ^
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    'return EventInfo.runNumber();' )
Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > expression) =>
    Cannot interpret the following expression:
return EventInfo.runNumber();

Make sure it is valid C++. (C++ exception of type runtime_error)
xAOD::TFileAccessTracer   INFO    Sending file access statistics to http://rucio-lb-prod.cern.ch:18762/traces/
Singularity>

Is this by design? Should one really not be able to use single-line short expressions like that in the Define(...) function? :confused:

Cheers,
Attila


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.16/00
Platform: x86_64-centos7-gcc8-opt
Compiler: GCC 8


:frowning:

Never mind… Apparently I should start reading the error messages more carefully…

The following single-line expression does work:

df2 = df.Define( 'run_number', 'EventInfo.runNumber();' )

I guess I understand where the confusion comes from, but it’s still a bit of a hidden feature that single-line and multi-line strings would be handled differently like this…

Cheers,
Attila

Hi Attila,
that behavior is not intentional. Ideally your first snippet would work just fine.
It’s probably a limitation of the regex we use to check whether there is a return in the string or not, see https://github.com/root-project/root/blob/e192a589a46cf97721b20652363c73b817bfa19a/tree/dataframe/src/RDFInterfaceUtils.cxx#L668-L670

PRs are welcome :smile: I’m currently not at CERN.

Cheers,
Enrico

Hi Enrico,

I don’t think it’s the TRegexp expression that’s at fault. I just made some tests myself, trying to see how that expression behaves when receiving differently formatted Python strings, but I think that’s fine.

I suspect that it’s the ColumnTypesAsString(...) function that messes things up somehow. But I don’t know yet how. :confused:

Cheers,
Attila

Well… Apparently we (in ATLAS) should be more proactive with updating the ROOT version that the ATLAS analysis releases use… :confused:

As I was gearing up to propose a fix for the ROOT master branch, I had to realise that I couldn’t reproduce the issue on the current master. What I did was to:

  • Write a small ROOT file with a TTree, that has one std::vector<float> variable;
  • Write a script that does:
import ROOT
df = ROOT.ROOT.RDataFrame( 'test', 'testfile.root' )
df2 = df.Define( 'testVectorSize', 'return testVector.size();' )

When I try to use this script with 6.16/00 (the version that we still use with our analysis releases), I get the same failure.

Singularity> lsetup "root 6.16.00-x86_64-centos7-gcc8-opt"
************************************************************************
Requested:  root ... 
 Setting up root 6.16.00-x86_64-centos7-gcc8-opt ... 
  ROOT is from lcgenv -p LCG_95 x86_64-centos7-gcc8-opt ROOT
>>>>>>>>>>>>>>>>>>>>>>>>> Information for user <<<<<<<<<<<<<<<<<<<<<<<<<
 root:
   Tip for _this_ standalone ROOT and grid (ie prun) submission:
    avoid --athenaTag if you do not need athena
    use --rootVer=6.16/00 --cmtConfig=x86_64-centos7-gcc8-opt
************************************************************************
Singularity> python fileProcessor.py
input_line_53:2:8: error: expected expression
return return testVector.size();
       ^
Traceback (most recent call last):
  File "fileProcessor.py", line 9, in <module>
    df2 = df.Define( 'testVectorSize', 'return testVector.size();' )
Exception: ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > expression) =>
    Cannot interpret the following expression:
return testVector.size();

Make sure it is valid C++. (C++ exception of type runtime_error)
Singularity>

But as soon as I switch to at least 6.18/00, the problem is gone. So I guess it’s clear, we need to update to 6.18/04… :stuck_out_tongue:

Cheers,
Attila

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.