Is it possible to get the types of the columns in a TDataFrame? I am interested in the types of all columns, both those from the source TTree and these added with Define
or Alias
. I would like to be able to determine the types before executing actions (e.g. just after calling Define
).
Hi,
this is the first time this is requested. Internally we managed all this information but we never exposed it in the interface. Two questions to understand better the usecase:
Why is this a need for your task?
Do you mean getting the type names or typeinfo?
Cheers,
Danilo
Hi,
note that TDF internals are all C++, so the complete type information is compile-time only.
There is no single object that holds this information, as types cannot be passed around like values.
Runtime type information (C++ RTTI, e.g. typeinfo
objects) is an easier but maybe less useful kind of information to provide.
I’m also curious what exactly is it that you need
Cheers,
Enrico
Hi Danilo, Enrico,
Essentially, I want to write bindings for a dynamically typed language (R
) and I need to handle the types at run-time as I don’t know what data I will get. (And I don’t want everything to be treated as double
). I imagine I need to do something like the commented section in this mock example:
#include <typeinfo>
void fill_tree(const char *treeName, const char *fileName) {
ROOT::Experimental::TDataFrame d(10);
int i(0);
d.Define("b1", [&i]() { return (double)i; })
.Define("b2", [&i]() { auto j = i * i; ++i; return j; })
.Snapshot(treeName, fileName);
}
int tdf_types() {
auto fileName = "tdf_types.root";
auto treeName = "myTree";
fill_tree(treeName, fileName);
ROOT::Experimental::TDataFrame d(treeName, fileName, {"b1"});
// max1 is double
auto max1 = d.Max("b2");
std::cout << "max1 (" << typeid(*max1).name() << ") = " << *max1 << std::endl;
// static type
auto max2 = d.Max<int>("b2");
std::cout << "max2 (" << typeid(*max2).name() << ") = " << *max2 << std::endl;
// wrong static type -> root knows the mismatch at run time and complains
auto max3 = d.Max<unsigned int>("b2");
std::cout << "max3 (" << typeid(*max3).name() << ") = " << *max3 << std::endl;
// I want to handle run-time dispatch like this:
/*
std::string column = "b2";
if (d.GetTypeId(column) == typeid(double)) {
do_something_with(d.Max<double>(column));
} else if (d.GetTypeId(column) == typeid(int)) {
do_something_with(d.Max<int>(column));
} else {
// ...
}
*/
return 0;
}
cheers,
Rosen
Hi,
when you write d.Max("b2")
, "b2"
might be a column of integers but TDataFrame will anyway return a double
(because it needs to decide what to return at compile time, and double
is a reasonable default in absence of an explicit template parameter).
So the type of the column and the type of the result of Max
can be different. I don’t understand if you would like to have the typeid
of the column or the typeid
of the result.
In the first case, I’m afraid TDataFrame does not (and in general cannot) provide that information: for just-in-time-compiled actions (e.g. d.Max("b2")
without a template parameter) the information on the actual column type is only handled in an opaque manner from the just-in-time-compiled code.
However, you can directly query your TTree/TChain for that information using the same facilities that TDataFrame uses: $ROOTSYS/tree/treeplayer/inc/ROOT/TDFUtils.hxx
contains the free functions ColumnName2ColumnTypeName
and TypeName2TypeID
for this purpose.
For user-defined columns (the ones created with Define
) you can use ROOT::TypeTraits::CallableTraits
to query the return type of the expression, e.g. with
#include "ROOT/TypeTraits.hxx"
template <typename F>
using CT = ROOT::TypeTraits::CallableTraits;
typeid(typename CT<decltype(lambda)>::ret_type);
If instead what you want is the typeid of a TDF result you can extract it from the type of the result itself:
#include "ROOT/TypeTraits.hxx"
template <typename T>
using ArgType = ROOT::TypeTraits::TakeFirstParameter_t;
auto result = d.Max("c");
typeid(ArgType<result>);
There might be stupid mistakes in the snippets but they should give you an idea.
Hope this helps!
Cheers,
Enrico
(By the way, in the end I did implement Aggregate
)
That’s already very helpful, thanks!
Even after adding another less trivial column
.Define("b3", [&i]() { return std::vector<int>{42, 43}; })
I can get the type name with
using ROOT::Internal::TDF::ColumnName2ColumnTypeName;
TChain ch;
ch.AddFile("tdf_types.root", 0, "myTree");
std::cout << "b3 type is " << ColumnName2ColumnTypeName("b3", &ch, nullptr) << std::endl;
// b3 type is vector<int>
Follow up questions:
- Can I somehow get the
TTree
to pass toColumnName2ColumnTypeName
fromTDF::TInterface<Proxied>
? - How do I get the type name for
Define
d columns where I use a string expression? - For my curiosity, how/where in the codebase do you generate the code to be JIT compiled from the type name? I can almost use
vector<int>
to JIT something myself, but some header will be needed (#include
andusing std
).
Thanks,
Rosen
PS. yes, Aggregate
partly inspired me to look more into TDF
Good!
Follow up answers:
- The
TTree
thatTInterface
would return is precisely the input tree, so you certainly can access it from outside TDF
2.+3. Eh, this is less easy. what happens is that given Filter("x > 0")
we pattern-match it to find out which column names are used inside, then we find out the types of those column names using ColumnName2ColumnTypeName
and then we jit something like Filter([](type_of_x x) { x > 0; })
. This is done in TDFInterface.cxx::JitTransformation
and the other functions called from there. If you can spare the runtime, you can do this yourself:
auto expr = "auto lambda = []("
+ ColumnName2ColumnTypeName(column) + " " + column + ")"
+ "{" + body + ";});"
auto tid = reinterpret_cast<type_info*>(gInterpreter->ProcessLine("typeid(" + expr + ")"));
As for the headers, cling takes care of that automatically.
Cheers,
Enrico
- It seems the canonical way to get the tree is
GetDataFrameChecked()->GetTree()
, butGetDataFrameChecked
is protected . Could you make this public or should I inherit in order to get access? - and 3. That’s very interesting, thanks for explaining!
I guess the snippet you suggest can become a part of the code generated inJitTransformation
, where you also return thetype_info
together with the current result in atuple
. The callerTInterface::Define
can then save thetype_info
for me, so I can query it later. Then there is no extra cost (and I don’t have to reimplement a lot of JitTransformation)
cheers,
Rosen
Hi Rosen,
- Neither you should get the TTree from the TFile – it will have the exact same result.
GetDataFrameChecked
andGetTree
are implementation details and should stay hidden. None of the TDF federation of classes is designed to support runtime polymorphism, for simplicity and for performance reasons. - and 3. it’s more complicated than that for TDataFrame: jitting of actions does not happen until right before the event loop, and jitting of transformations will also be made lazier before the next release. So the type_info information will not be available until you have triggered the event loop at least once.
We want to have the freedom to perform these kind of tricks under the hood – this is the major performance advantage of declarative interfaces. But in order to be able to reshuffle the insides as much as it’s needed, we have to avoid exposing too much of them to the users. I’m sorry!
What we could do is extract a free MakeLambda(typenames_str, columns_str, body_str)
that you could also take advatange of. That wouldn’t hurt at all. If you can work with that I can put it in my todo list
Cheers,
Enrico
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.