TDF and branches with leaflists

behrenhoff · December 5, 2017, 10:47am

Hello everybody,

I’m trying out the TDataFrame - but I am running into some problems. Is it possible to do tree->Draw("branch.leaf") using TDF?

In the following example program below, I am creating a simple tree with one branch with a float and an int leaf. Does TDF support this in any way?

Also, the available actions: there is tdf.Mean() - how to calculate tdf.StdDev? Do I have to write my own (lazy) action? If so, is there a guide how to do it (ok, i could use Histo1D()->GetRMS()).

Is there an “early stop” or “break” in Filter? I would like to know if an event with a certain condition exists. When using Filter().Count() != 0 the count needs to loop over the whole tree.

#include <ROOT/TDataFrame.hxx>
#include <TBrowser.h>
#include <TCanvas.h>
#include <TTree.h>
#include <iostream>
#include <random>

using namespace std;

struct BrVal {
    Float_t a;
    Int_t i;
};

void tdf_test() {
    auto t = new TTree("t", "t");
    BrVal val;
    t->Branch("b", &val, "a/F:i/I");
    mt19937 rng;
    normal_distribution<Float_t> nd1(0, 1);
    uniform_int_distribution<Int_t> ud(-10, 10);
    for (size_t n = 1000; n; --n) {
        val.a = nd1(rng);
        val.i = ud(rng);
        t->Fill();
    }

    // now how to draw the histo, get mean, stddev?
    using TDF = ROOT::Experimental::TDataFrame;
    try {
        TDF df(*t);

        df.Histo1D("b.a")->Draw();
        // Doesn't work:
        // Unknown column: b.a

        df.Define("aa", [](BrVal bv) { return bv.a; }, {"b"}).Histo1D("aa")->Draw();
        // Doesn't work:
        // Error in <TTreeReaderValueBase::GetBranchDataType()>: The branch b was created using
        // a leaf list and cannot be represented as a C++ type. Please access one of its
        // siblings using a TTreeReaderArray:
        // Error in <TTreeReaderValueBase::GetBranchDataType()>:    b.a
        // Error in <TTreeReaderValueBase::GetBranchDataType()>:    b.i
        // Error in <TTreeReaderValueBase::CreateProxy()>: The branch b contains data of type
        // {UNDETERMINED TYPE}, which does not have a dictionary.

        // also doesn't work for the same reason
        cout << *df.Mean("b.a") << '\n';
        cout << *df.Mean("b.i") << '\n';
    } catch (std::runtime_error &e) {
        cout << e.what() << '\n';
    }
    new TCanvas;
    t->Draw("b.a");
    // works!
}

eguiraud · December 5, 2017, 11:40am

Hi Wolf,
thank you for the precious feedback!

Branches with leaflist
They are currently not supported. We should document this, I’ll take care of it.

What’s the reason why you cannot save “b” as a (split) struct?

StdDev action
tdf.StdDev is a good candidate to be added as a TDF action. If you feel like contributing it, you can just copy what is done for Mean in TDFInterface.hxx, add a StdDevHelper class in TDFActionHelpers.hxx (by copying MeanHelper) and add it as an action type in TDFUtils.hxx. It should be straightforward.
Currently there is no other way to plug-in your own action.

“Downstream” solutions would be using Reduce or Foreach to perform the calculation you want.

Early stop of the event-loop
This is another good idea for a missing feature. We already have an early-stopping mechanism implemented, but it is only triggered by Range, and only available in single-thread executions. You would want a While transformation or something like that. It can be thought. Could you maybe open a jira ticket with the feature request?

behrenhoff · December 5, 2017, 12:59pm

I could, and there is nothing speaking against it from the technical point of view! (by split struct you mean one branch per variable in the struct? Or is there some other feature I haven’t discovered yet?) However, I often use it as “grouping” feature. If your tree has many branches and you have a branch for each single element, you basically loose the ability to find important things quickly (it makes a difference if you have 150 branches or 500 branches: you have to scroll much more in the TBrowser!). Also, sometimes the leaves really belong together, for example when storing a 128 bit integer (as 2xULong64_t) or when storing a date (multiple variables, e.g. epoch/year/month/day). In those cases I don’t see the point in having separate branches.
Also I prefer to have POD values instead of objects. That makes things easier and a lot of tools I have are designed to work on arbitrary input files (e.g. a plotting tool that plots ALL leaves of all branches versus some target variable) - and that’s just much more easy to implement if you only have plain values.

eguiraud · December 5, 2017, 1:32pm

I see. How about this? Leaves are “packed together” in TBrowser, can be accessed individually by TDF:

#include <iostream>
#include <ROOT/TDataFrame.hxx>
#include <TInterpreter.h>
using namespace ROOT::Experimental;

struct BrVal {
   float a;
   int i;
};

int main()
{
  // quick and not-sure-how-dirty way to get I/O for BrVal
  gInterpreter->Declare("struct BrVal { float a; int i; };");
  // write one BrVal to a TTree 
  TTree t("t", "t");
  BrVal val;
  t.Branch("b", &val);
  val.a = 4.2;
  val.i = 42;
  t.Fill();

  // read the branches with TDF
  TDataFrame d(t);
  std::cout << d.Take<float>("a")->at(0) << std::endl;

  return 0;
}

behrenhoff · December 5, 2017, 1:32pm

I don’t think it is.

There is more than one way to calculate StdDev. The obvious way is to calculate the mean first - you need to sum (currentVal - mean)². So do I need two loops here? If I guess things in the code correctly, “nSlots” is for running in parallel and each slot is filled by one thread? Can tdf.Mean, .Count, and .StdDev share counters?

In boost accumulators, variance can be calculated in two ways (lazy / not lazy). The lazy way works with a single loop. But then how to work with the slot parameter? Also lazy might be less accurate. So at least to me it is not immediately clear how to proceed.

eguiraud · December 5, 2017, 1:47pm

Fair enough, I just proposed a PR in case you had it “hot under the fingers”, but no matter, it’s now in my to-do list

As for the other questions:
There are various algorithms to evaluate standard deviations in a single pass.
You are correct, each thread fills one slot.
Different action helpers (MeanHelper and StdDevHelper in this case) cannot share counters (what if they have different filters?)
Floating point accuracy issues would mainly arise from the running sums – there are ways to cope with them

I am afraid these are very busy weeks and I will not able to put StdDev in any time soon, but it will certainly be added as soon as possible.
We also decided that in your first example df.Histo1D("b.a")->Draw(); should work – we will fix it.
Early stopping of the event loop is also on my to-do list now but it is a difficult problem to solve in parallel executions.

As always, your feedback is super welcome (and very useful).
Cheers,
Enrico

eguiraud · December 6, 2017, 10:15pm

fyi, I created an issue to track the support of branches with leaflists.

system · December 20, 2017, 10:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

eguiraud · April 4, 2018, 11:32am

Hi Wolf,
I’m resurrecting this old thread to let you know that as of yesterday’s master branch TDataFrame supports branches with leaflists and several other common nested branch topologies, e.g.

tdf.Filter("b1.b2.leaf > 0").Filter([](double x) { return x < 0; }, {"str.member"})

Cheers,
Enrico

eguiraud · April 5, 2018, 4:00pm

This topic was automatically closed after 28 hours. New replies are no longer allowed.