Column types of a TDataFrame

Is it possible to get the types of the columns in a TDataFrame? I am interested in the types of all columns, both those from the source TTree and these added with Define or Alias. I would like to be able to determine the types before executing actions (e.g. just after calling Define).

Hi,

this is the first time this is requested. Internally we managed all this information but we never exposed it in the interface. Two questions to understand better the usecase:
Why is this a need for your task?
Do you mean getting the type names or typeinfo?

Cheers,
Danilo

Hi,
note that TDF internals are all C++, so the complete type information is compile-time only.
There is no single object that holds this information, as types cannot be passed around like values.

Runtime type information (C++ RTTI, e.g. typeinfo objects) is an easier but maybe less useful kind of information to provide.

I’m also curious what exactly is it that you need :slight_smile:

Cheers,
Enrico

Hi Danilo, Enrico,

Essentially, I want to write bindings for a dynamically typed language (R) and I need to handle the types at run-time as I don’t know what data I will get. (And I don’t want everything to be treated as double). I imagine I need to do something like the commented section in this mock example:

#include <typeinfo>

void fill_tree(const char *treeName, const char *fileName) {
   ROOT::Experimental::TDataFrame d(10);
   int i(0);
   d.Define("b1", [&i]() { return (double)i; })
      .Define("b2", [&i]() { auto j = i * i; ++i; return j; })
      .Snapshot(treeName, fileName);
}

int tdf_types() {
   auto fileName = "tdf_types.root";
   auto treeName = "myTree";
   fill_tree(treeName, fileName);
   ROOT::Experimental::TDataFrame d(treeName, fileName, {"b1"});

   // max1 is double
   auto max1 = d.Max("b2");
   std::cout << "max1 (" << typeid(*max1).name() << ") = " << *max1 << std::endl;

   // static type
   auto max2 = d.Max<int>("b2");
   std::cout << "max2 (" << typeid(*max2).name() << ") = " << *max2 << std::endl;

   // wrong static type -> root knows the mismatch at run time and complains
   auto max3 = d.Max<unsigned int>("b2");
   std::cout << "max3 (" << typeid(*max3).name() << ") = " << *max3 << std::endl;

   // I want to handle run-time dispatch like this:
   /*
   std::string column = "b2";
   if (d.GetTypeId(column) == typeid(double)) {
       do_something_with(d.Max<double>(column));
   } else if (d.GetTypeId(column) == typeid(int)) {
       do_something_with(d.Max<int>(column));
   } else {
       // ...
   }
   */
   return 0;
}

cheers,
Rosen

Hi,
when you write d.Max("b2"), "b2" might be a column of integers but TDataFrame will anyway return a double (because it needs to decide what to return at compile time, and double is a reasonable default in absence of an explicit template parameter).

So the type of the column and the type of the result of Max can be different. I don’t understand if you would like to have the typeid of the column or the typeid of the result.

In the first case, I’m afraid TDataFrame does not (and in general cannot) provide that information: for just-in-time-compiled actions (e.g. d.Max("b2") without a template parameter) the information on the actual column type is only handled in an opaque manner from the just-in-time-compiled code.
However, you can directly query your TTree/TChain for that information using the same facilities that TDataFrame uses: $ROOTSYS/tree/treeplayer/inc/ROOT/TDFUtils.hxx contains the free functions ColumnName2ColumnTypeName and TypeName2TypeID for this purpose.
For user-defined columns (the ones created with Define) you can use ROOT::TypeTraits::CallableTraits to query the return type of the expression, e.g. with

#include "ROOT/TypeTraits.hxx"
template <typename F>
using CT = ROOT::TypeTraits::CallableTraits;
typeid(typename CT<decltype(lambda)>::ret_type);

If instead what you want is the typeid of a TDF result you can extract it from the type of the result itself:

#include "ROOT/TypeTraits.hxx"
template <typename T>
using ArgType = ROOT::TypeTraits::TakeFirstParameter_t;
auto result = d.Max("c");
typeid(ArgType<result>);

There might be stupid mistakes in the snippets but they should give you an idea.

Hope this helps!
Cheers,
Enrico

(By the way, in the end I did implement Aggregate :slight_smile: )

That’s already very helpful, thanks!

Even after adding another less trivial column

 .Define("b3", [&i]() { return std::vector<int>{42, 43}; })

I can get the type name with

   using ROOT::Internal::TDF::ColumnName2ColumnTypeName;
   TChain ch;
   ch.AddFile("tdf_types.root", 0, "myTree");
   std::cout << "b3 type is " << ColumnName2ColumnTypeName("b3", &ch, nullptr) << std::endl;
   // b3 type is vector<int>

Follow up questions:

  1. Can I somehow get the TTree to pass to ColumnName2ColumnTypeName from TDF::TInterface<Proxied>?
  2. How do I get the type name for Defined columns where I use a string expression?
  3. For my curiosity, how/where in the codebase do you generate the code to be JIT compiled from the type name? I can almost use vector<int> to JIT something myself, but some header will be needed (#include and using std).

Thanks,
Rosen

PS. yes, Aggregate partly inspired me to look more into TDF :slight_smile:

Good!

Follow up answers:

  1. The TTree that TInterface would return is precisely the input tree, so you certainly can access it from outside TDF

2.+3. Eh, this is less easy. what happens is that given Filter("x > 0") we pattern-match it to find out which column names are used inside, then we find out the types of those column names using ColumnName2ColumnTypeName and then we jit something like Filter([](type_of_x x) { x > 0; }). This is done in TDFInterface.cxx::JitTransformation and the other functions called from there. If you can spare the runtime, you can do this yourself:

auto expr = "auto lambda = []("
            + ColumnName2ColumnTypeName(column) + " " + column + ")"
            + "{" + body + ";});"
auto tid = reinterpret_cast<type_info*>(gInterpreter->ProcessLine("typeid(" + expr + ")"));

As for the headers, cling takes care of that automatically.

Cheers,
Enrico

  1. It seems the canonical way to get the tree is GetDataFrameChecked()->GetTree(), but GetDataFrameChecked is protected :confused: . Could you make this public or should I inherit in order to get access?
  2. and 3. That’s very interesting, thanks for explaining!
    I guess the snippet you suggest can become a part of the code generated in JitTransformation, where you also return the type_info together with the current result in a tuple. The caller TInterface::Define can then save the type_info for me, so I can query it later. Then there is no extra cost (and I don’t have to reimplement a lot of JitTransformation) :wink:

cheers,
Rosen

Hi Rosen,

  1. Neither :sweat_smile: you should get the TTree from the TFile – it will have the exact same result. GetDataFrameChecked and GetTree are implementation details and should stay hidden. None of the TDF federation of classes is designed to support runtime polymorphism, for simplicity and for performance reasons.
  2. and 3. it’s more complicated than that for TDataFrame: jitting of actions does not happen until right before the event loop, and jitting of transformations will also be made lazier before the next release. So the type_info information will not be available until you have triggered the event loop at least once.

We want to have the freedom to perform these kind of tricks under the hood – this is the major performance advantage of declarative interfaces. But in order to be able to reshuffle the insides as much as it’s needed, we have to avoid exposing too much of them to the users. I’m sorry!
What we could do is extract a free MakeLambda(typenames_str, columns_str, body_str) that you could also take advatange of. That wouldn’t hurt at all. If you can work with that I can put it in my todo list :slight_smile:

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.