Pandas dataframe to TH1

yosse_andrean · March 3, 2020, 8:52am

Dear experts,

I am using uproot to convert a TTree into a pandas.dataframe. The structure of the dataframe can be seen below. Note that ‘met’ is an entry level variable, while ‘mu_cells_*’ is a subentry level variable.

How can one make a TH1 histogram of ‘met’? Is there a function for this, or do I have to loop over the dataframe and .Fill()? It has to loop over entry (not sub-entry) to avoid multiplecounting.

Similarly how do I make a TH1 of ‘mu_cells_e’ now that it has to loop over sub-entry?

Best,
Yosse

                         met  mu_cells_e  mu_cells_side  mu_cells_tower
entry subentry                                                         
0     0         71755.648438  179.995682             -1               6
      1         71755.648438 -308.388519             -1               7
      2         71755.648438   15.558195             -1               8
      3         71755.648438  252.033691             -1               6
      4         71755.648438  459.172119             -1               7
...                      ...         ...            ...             ...
7107  22        26328.087891  611.708374              1               4
      23        26328.087891  -13.317616              1               6
      24        26328.087891   12.681366              1               2
      25        26328.087891   -4.776075              1               4
      26        26328.087891  -17.860764              1               6

[173410 rows x 4 columns]

ROOT Version: v6.18.04
Platform: lxplus CentOS7

eguiraud · March 3, 2020, 8:59am

Hi,
ROOT has no notion of pandas dataframes so it does not provide functions to work with them. The easiest thing is indeed to loop over the dataframe and Fill the histogram.

If there are issues with data representation (e.g. entry vs sub-entry) they must be solved on the pandas dataframe side. Awkward arrays as returned by uproot should provide a convenient interface for these kind of operations (but please note that the ROOT team is not involved in the development of uproot nor awkward arrays and the project has its own help channels, e.g. github issues).

Depending on your needs ROOT might provide a simpler, although maybe less pythonic, solution: RDataFrame:

df = ROOT.RDataFrame("treename", "filename.root");
met_histo = df.Histo1D("met"); # correctly loops over entries

Cheers,
Enrico

yosse_andrean · March 3, 2020, 9:29am

Hi Enrico,

Thanks for the quick response!

The reason I am not using RDataFrame (though I have tried) is that in my application I want to select sub-entries based on the sub-entry variable. For example, I need to plot ‘mu_cells_e’ for every cells with the same index that passes “mu_cells_side == 1 && mu_cells_tower == 6”. I find it difficult to achieve this in RDataFrame.

Thanks for pointing me that I need to loop over the dataframe, guess I have to look elsewhere on how to loop entry/sub-entry.

Best,
Yosse

eguiraud · March 3, 2020, 9:40am

I need to plot ‘mu_cells_e’ for every cells with the same index that passes “mu_cells_side == 1 && mu_cells_tower == 6”

As you are not filtering a whole event, in RDataFrame you would need to Define a new column. RDF reads all arrays/vectors as RVec, a C++ object with some nice helpers:

df = df.Define("good_mu", "mu_cells_e[mu_cells_side == 1 && mu_cells_tower == 6]")
mu_cells_h = df.Histo1D("good_mu")

As usual, the Define expression needs to be valid C++, even when working in python. That’s the price of performance with RDF (although, loading everything in RAM once like pandas does, when possible, might be faster overall depending on the application).

Cheers,
Enrico

henryiii · March 3, 2020, 3:15pm

You’ll need to pull out a Series first for any further computation, because ROOT or any other tool will not know about Pandas sub-indexing. That can be done like this:

mu_cells_side = frame.mu_cells_side.xs(0, level='subentry')

Now you can use the TH1’s .FillN(len(mu_cells_side), mu_cells_side, ROOT.nullptr) or boost-histogram’s .fill or NumPy, as it is a normal array at this point (and feel free to call mu_cells_side = np.asarray(mu_cells_side) if any of those care about it being a true np array, but I don’t think they do). This will be much faster than trying to loop in Python.

system · March 17, 2020, 3:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.