TDataFrame: Mapping vector entries to single values

Hi,

I’m currently playing with the new TDataFrame and I really like the way it functions. Currently I wondering if there is any option to map a vector to single values. As far as I understand you can define new variables on the fly by using the Define(…) function, but this will only return a single value and not the entire vector in single values.

The main issue is, that I have vectors for each observable because several detectors can fire per event. But sometimes you did not want to reject the whole event, because one detector failed to pass a cut.

Cheers ,
Thomas

Hi tquante,
thanks for checking out TDataFrame!

The answer to your question strongly depends on what you are actually trying to accomplish.
TDataFrame, at least in its current implementation, is very much oriented to per-event processing (parallel processing is based on chunking of events, all actions act on one event at a time, etc…), so you cannot take a vector in an event and treat it as if it was multiple events – it would not play well with the rest of the framework.

If you don’t want to reject a whole event in case one detector failed to pass a cut, just don’t! For example you might Define a new boolean column that signals whether a certain detector has passed a cut or not – you keep the event and the cut information, and can use that information in further processing.

Bottom line is: as far as I know, there is no way around per-event processing with TDataFrame, but probably there are ways to do exactly what you want anyway. The how depends on what “exactly what you want” is.

Hi,
thanks for your reply. I have played around a bit more and tried to define a new column for each detector which contain the whole event information for this detector. Its works quite nice, but if i start to grow the number of columns the performance drops down quite significant:

1detector approx. 6 seconds
12 detectors 56 seconds

I assume that most of this stems from creating structs on the fly. So I need to play a bit more to get to find the optimal setup. I need to get rid of the typical TTree thinking which is quite different and adapt the new workflow.

Hi,
can you maybe share all or part of the TDF analysis?
I might be able to give some suggestions on how to improve the performance, especially if you are writing c++. If you are writing python code, we’ll introduce some performance improvements for the next release.

Hi,

sry for the late reply. I make some tests and the TDF behaves somewhat unexpected to me.

In my current analysis I define a set of preselections which are filter of filter … etc like in the examples. If I do the analysis this way everything works fine and fast. The next stage defines Filters like the following:

lastprefilter.Filter(firstsubbranch);
lastprefilter.Filter(secondsubbranch);
.
.
.

Even when i assure that no event can pass through the preselection the execution time grows (linearly ?) with each filter added. I would expect that the additional filters would not contribute to the execution time because no event is left.

Is there any way around this effect, beside defining everything in one big lambda and defining several new variables?

Thanks for your patience with me and your help

Cheers
Thomas

Hi,
first of all thank you for trying out and profiling TDataFrame, we can definitely use more user feedback especially now that everything is so fresh.

Regarding your latest issue:

tdf.Filter(f1).Filter(f2).Histo1D("x");

has a very different meaning than

tdf.Filter(f1);
tdf.Filter(f2);

In the first case f2 is executed only if f1 passes, in the second case both filters are executed every time. This becomes clear if you add ad action at the end of the functional chain:

auto h = tdf.Filter(f1).Filter(f2).Histo1D("x");
//vs
auto h1 = tdf.Filter(f1).Histo1D("x");
auto h2 = tdf.Filter(f2).Histo1D("x");

In the first case h is filled with entries that pass both f1 and f2, in the second case h1 is filled with entries that pass f1, h2 is filled with entries that pass f2.

TDF does not execute “downstream” filters for events that do not pass “upstream” filters.

Anyway now we are straying further and further from the topic of this thread. If you encounter a specific issue with TDataFrame or you feel like TDF is doing something suboptimal under the hood, please open a new thread in the forum. If you feel the documentation is lacking in some places either open a new thread here, or even better open a jira issue, or even better suggest a fix and open a PR on our github repo.

Looking forward to more feedback,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.