Cannot TDataFrame Define same name in two sibling Filters

kkrizka · November 14, 2017, 12:50am

Hi, I am seeing some unexpected (at least for me) behaviour when trying to Define with the same variable name on two dataframes (df1 and df2) that were derived from a single TDataFrame (df) using the Filter function. The second Define call fails with the error “Redefinition of column “y””. Code demonstrating it is as follows:

rootdftest.py (571 Bytes)

Is this expected behaviour? The documentation claims that Define “Create a temporary branch that will be visible from all subsequent nodes of the functional chain”. The word subsequent indicates to me that this should work. Also df2.Histo1D(“y”) crashes, which would seem that the variable is indeed not defined in the second Filter’ed dataframe.

I am using the latest dev release, v6-11-02.

eguiraud · November 14, 2017, 8:31am

Hi kkrizka,
yes this is the expected behavior: the visibility of the defined column is as stated in the docs. What is not stated (and I will see that it is added asap) is that names of defined column must be unique at the graph level.
This is an implementation limitation that cannot be lifted without some loss in performance, so I am not sure it will be lifted any time soon.

Cheers,
Enrico

kkrizka · November 14, 2017, 9:28pm

Hi Enrico,

I think that the current behaviour takes away some of the usability. Consider the case where I want to try different definitions of an object and then apply the same selection. An example being trying different working points for electron ID. Would be nice if I could just run a Define to select the electron type and then use common code to apply a common kinematic selection.

But I understand the desire for performance and I guess the extra book keeping in the current setting (tracking the definition as an extra setting during subsequent selection) is not too bad.

Speaking of performance, are there any plans in the future to support batch systems?

–
Karol Krizka

eguiraud · November 14, 2017, 11:20pm

Hi,
I am not convinced the usability penalty is that bad (although I am ready to be convinced ).
You can do what you said and programmatically produce different names for different defined columns:

// pseudo-code
TDataFrame d("tree", "f.root");
std::vector<TDF::TResultProxy<TH1D>> histos;
for (int i = 0; i < nTests; ++i) {
  const auto column = "test_" + std::to_string(i);
  auto h = d.Define(column, [] (double x) { return Test(i, x); }) // different definition
          .Filter(UniqueFilter, {column}) // same filtering
          .Histo1D(column);
  histos.emplace_back(h);
}

What am I missing?

This is a very broad question, the answer could be “we already do”, “yes” or “no” depending what exactly you refer to For example it’s possible to run a TDataFrame analysis as a condor/slurm job. My guess is that you refer to executing the analysis on multiple nodes of a computing cluster? In which case the answer is “if we can pull it off”. @Danilo might be able to comment more in depth on this topic.

kkrizka · November 16, 2017, 12:11am

Hi Enrico,

Maybe I should mention the way I am trying to use the TDataFrame classes. I am currently using an eventloop-like analysis where I hard-code of all of my selections and histograms into a program that gets submitted to the local batch system. I am finding this a bit annoying as it requires a lot of bookkeeping of every change I make and isn’t very efficient (I tend to submit much more histograms than I actually look at, just in case I need them…). I would like to switch to a more interactive model using Jupyter notebooks. I think data frames are ideal for this. I see them as a TTree::Draw, but much more powerful. For example, I can define a basic selection and then extend the graph with an extra Filter to see what additional cuts would do. I can quickly try out different things without having to rerun the cached basic selections. The key requirement in this approach is that I want to change the dataframe graph on-the-fly.

I agree that the usability penalty in the above example is not too bad. I would have preferred the following usage though, as it means I don’t have to track my “settings” in two places (dataframe variable and the column name). It is a bit more annoying if you do plan to try out many things interactively as you tend to hardcode more things. At the same time, fast performance is very important if you want to work interactively.

def common_selection(df):
  return df.Filter('muon_pt>10').Filter('abs(muon_eta)<2.4'))

df=ROOT.TDataFrame('tree', 'f.root')
df_tight=df.Define('muon_pt','muon_tight_pt').Define('muon_eta','muon_tight_eta')
df_loose=df.Define('muon_eta','muon_loose_eta')
common_selection(df_tight).Histo1D('muon_pt')
common_selection(df_loose).Histo1D('muon_pt')

This is just a simple example. I have more complicated use cases, such as trying different heuristics to select the correct large-R jet for an analysis looking for hadronicly decaying particles boosted by an additional jet. The define specifies the index of the large-R jet and then common code calculates/plots any common variables (invariant mass of the large-R jet, pT…) for “fatjet[candidateIdx]”.

However today I found a much more severe penalty with the way Define’ed columns are treated. If I would like to change a defined column (ie: I found a bug in my definition), I will have to reset my entire notebook to update the definition and rerun things that don’t need rerunning… Maybe allowing to “redefine” a Define would be good?

As for support batch system, I would like a way for TDataFrame to use the cluster in the background when the graph has to be evaluated. Not coded scripts that have to be rerun in their entirety anytime I make a simple change. I am not sure how easy this is technically.

–
Karol Krizka

eguiraud · November 16, 2017, 1:50pm

Hi,
thanks for the detailed reply. Summary of how I understood it, and replies, follow.

You would like to define columns with the same name in different branches of the same TDF graph.
As I said this is currently not possible because of “accidental” implementation choices: it doesn’t have to be this way, although there are performance reasons why it is so. Nevertheless it might be possible and useful to lift this limitation in the future – now that I am seriously considering the possibility I think there might be a way to do it without significant runtime penalty. Some doubts remain on the subtle logical errors that users might make if we let them define the same name differently in different chains.
You would like to re-define an already-defined column. Interesting idea, it might definitely be useful in interactive environments such as jupyter notebooks. Again I’m unsure whether this would bring more confusion than benefits: consider auto h1 = tdf.Define("x", myx).Histo1D("x"); auto h2 = tdf.Redefine("x", myotherx).Histo1D("x"); …users will certainly do this and be surprised when both histograms come out with the same data (because of TDF lazy evaluation policy). But thanks a lot for bringing up the idea and your use-case for it. I think it’s worth considering it carefully and I will definitely bring it up in discussions with the other people in the team involved in the design of TDF.
You would like TDataFrame to run the event loop on a remote cluster rather than on the local machine, when you code in a jupyter notebook. Now this is more than a feature request it requires major developments in several areas. It is definitely something that has been discussed and is even suggested by the community white paper on data analysis of the hep software foundation, so I agree with you it’s something desirable

Thanks for the great input!
Enrico

kkrizka · November 16, 2017, 7:44pm

Hi,

I’m not sure I understand why the example in 2 will fail. Can’t the call to Redefine flush any existing evaluation cache or set a dirty bit?

Just to clarity, I would prefer if Define could redefine a variable instead of introducing a new Define function. Again, this comes from my use-case of using TDataFrame in an interactive environment. I don’t want two separate codes for whether I am restarting a jupyter kernel or just updating existing code.

Thank you for considering my feedback!

–
Karol Krizka

Danilo · November 17, 2017, 7:47am

Hi Karol,

about interactive usage of cluster resources, we are making progress with some R&Ds here at CERN. For example, you can have a look at https://github.com/etejedor/root-spark .

TDF is a tool for declarative analysis which is inspired to functional principles. In this kind of context, speaking about re-definitions is not probably fitting the overall picture. Nevertheless your interactive use case will probably be more and more common and we take your feedback seriously - it might be useful to arrive to some kind of compromise in the future.

For my education, how is your Jupyter Notebook setup structured? Is it running on your machine? on the local cluster of your university/lab?

Cheers,
Danilo

kkrizka · November 20, 2017, 7:53pm

Hi Danilo,

Yes, I’m aware of root-spark and have been playing with it a bit. It’s part of my long term plan to switch away from ROOT to something that’s more commonly used by the data science community (aka receives more dedicated support from computing professionals, has more transferable skills,…). Apache Spark is on top of my list for alternatives to ROOT. The root-spark project is awesome and made testing Spark quite easy. But I’m still stuck at the step of how to make histograms. I don’t think it is easy(/possible?) to get TH1 in scala and I don’t have experience with too many other plotting frameworks. That’s why I am trying out TDataFrames for now. It has similar concepts, but comes with ROOT.

I would say that a declarative analysis and an interactive environment are a very natural combination. The lack of necessary logic statements makes it much easier to write code interactively and quickly perform studies. Also I’m not requesting redefinition to be part of a finalized analysis flow. I would like to use them to permanently fix bugs in existing definitions without having to rerun everything.

I’ve been using Jupyter Hub installed on my lab’s cluster to look at plots. This way I don’t have to manually transfer any of the input data files. I haven’t been able to get TDataFrames working there yet. There is a problem compiling the latest ROOT on the cluster, but the computing guys are looking into this. Something about builtin_lz4 trying to use the fancy intel cluster compiler instead of gcc…

–
Karol Krizka

Danilo · November 26, 2017, 9:19pm

Karol,

I think there is some confusion about the nature of the link I sent. The repo contains a tiny wrapper on top of PyROOT which allows to issue map reduce queries on Spark resources linking ROOT to Spark via PyROOT and PySpark. It works quite well and, as you point out, with ROOT present, it’s very natural to create histograms and alike.

Thanks a lot for the description of your cluster setup!

Cheers,
D

system · December 10, 2017, 9:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.