RDataFrame use of undeclared identifier

Hi,

My Filter statement using a logical string is failing. The variables
used in the string are not recognized although they seem to use
the column names of he dataframe.

root [0] .L mytest2.C
root [1] mytest2()
sum counts in csv: 607
fSYear
fEYear
fCountry
fProduct
fTicket
fZipCode
fOrder
fUserId
fPrice
fCount
select: fCountry=="CH"&&fSYear==2016&&fOrder="b2b"
input_line_13:1:46: error: use of undeclared identifier 'fCountry'
namespace __rdf_0{ auto rdf_f = []() {return fCountry=="CH"&&fSYear==2016&&fOrder="b2b"
                                             ^
input_line_13:1:62: error: use of undeclared identifier 'fSYear'
namespace __rdf_0{ auto rdf_f = []() {return fCountry=="CH"&&fSYear==2016&&fOrder="b2b"
                                                             ^
Error in <TRint::HandleTermInput()>: std::runtime_error caught: Cannot interpret the following expression:
fCountry=="CH"&&fSYear==2016&&fOrder="b2b"

Make sure it is valid C++.

Please find attached the code
and a necessary data file.

-Eddy


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.20.02
Platform: MacOSX
Compiler: Not Provided


mytest2.C (4.1 KB) ticket2.txt (974 Bytes)

I guess @eguiraud can help you.

Hi,
the dataframe object on which you call Filter(select), at line 110, is empty because of line 98: auto tdf_cons = tdf_empty. The lines after 98 add the columns you want, but you need to save the result of that chain of calls and call Filter(select) on that modified RDF object:

auto tdf_cons_with_columns = tdf_cons.DefineSlotEntry("data",[&data](unsigned int /*slot*/,ULong64_t entry) {return data[entry]; } )             
    .Define("fUserId", [](TData &d) {return d.fUserId; },{"data"})                                                      
    .Define("fCountry",[](TData &d) {return d.fCountry;},{"data"})                                                      
    .Define("fZipCode",[](TData &d) {return d.fZipCode;},{"data"})                                                      
    .Define("fOrder",  [](TData &d) {return d.fOrder;  },{"data"})                                                      
    .Define("fCount",  [](TData &d) {return d.fCount;  },{"data"})                                                      
    .Define("fPrice",  [](TData &d) {return d.fPrice;  },{"data"})                                                      
    .Define("fSYear",  [](TData &d) {return d.fSYear;  },{"data"})                                                      
    .Define("fEYear",  [](TData &d) {return d.fEYear;  },{"data"});                                                     
                                                                                                                                                                                          
  auto tdf_cut = tdf_cons_with_columns.Filter(select);   

This should fix that problem.
Also note that fOrder=\"b2b\ is missing an = sign at line 134.

Cheers,
Enrico

Hi Enrico,

Thank you for your answer but I am confused. I thought that my
Filter statement would be the action statement that would modify the RDF object tdf_cons .

Is there a command that let’s me easily check that the data linked to the column name exists ? The current error message is ambiguous at least, the variable exists (column name is there) but its associated columns are not allocated yet.

-Eddy

Hi Eddy,
you can ask an RDF object what columns it knows about (and their type) with GetColumnNames() (and GetColumnType(colName).

If you call GetColumnNames on tdf_cons at line 110, you’ll see it returns an empty vector. If you assign the result of the sequence of Defines to a new variable, like in the snippet in my last message, and you call GetColumnNames() on that tdf_cons_with_columns variable, you’ll see it’s filled with all your Defines.

I guess the confusion comes from the fact that df.Define("x", "42") does not modify df. You need to define auto df2 = df.Define("x", "42") and then df2 will have the definition. Just like Filter returns a new dataframe that filters out some entries, Define returns a new dataframe that adds a new column.

Cheers,
Enrico

Hi Enrico,

I am slow; there was a fundamental misunderstanding on my side about the implementation of RDataFrame. I thought that in the example below d1 and d2 were the same .

ROOT::RDataFrame d1(10);
auto d2 = d1.Define("x","42").Filter("x==42");
auto d3 = d1;

-Eddy

Yes, I tried to clarify the misunderstanding in the last paragraph of my previous reply: did I succeed? :smile:

If yes, I believe we can mark the thread as resolved?

Cheers,
Enrico

Hi Enrico,

I understand now but this way using RDataFrame has much more subtleties then I expected. In my code example I lived on borrowed time in routine MakeDataFrame defining new columns followed by a summation on an already existing column (fCount).

-Eddy

MakeDataFrame looks ok at a first glance.

I would argue it’s actually easier to reason about immutable nodes of a computation graph, and Define/Filter transformations that do not have side-effects but return the product of the transformation, the next node of the computation graph.

We have seen analysts make use of this to e.g. create a filtered_df with some high-level filters applied, which then they pass to several routines that produce branches of the computation graph for different systematics. If you had to worry about each of these functions modifying the dataframe, or if you had to worry about Define naming conflicts between these functions, it would make everything more complicated.

But I might be missing your point :sweat_smile:

I agree, my MakeDataFrame forgot internally to do this crucial step of returning the transformation to the next node of the graph. It happily applied then an operation (summing the counts) on the old node (where it succeeded because it only depended on info from the old node). :sweat_smile:

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.