Working with rows and columns of a ntuple

Hi,

I am learning to use root. I would like to do some statistical analysis and also time series processing and fitting. I am turning to root instead of the usual tools in my field of application (some R and a lot of python with numpy, pandas and scikit-learn) because I need integration into a C++ environment.

The data I need to process corresponds to a set of sample sites where several variables are measured over time. So for a particular site, we have, for each time stamp v1, v2, … vn. For every site, the variables are the same, but the time stamps can be different from site to site. Each site corresponds to a category and several sites can belong to the same category. We usually store these data on SQLite data base using a single table. So we actually have tabular data like this (for 2 vars and 3 possible time stamps):

| site | category | v1_t1 | v2_t1 | v1_t2 | v2_t2 | v1_t3 | v2_t3 |
|------+----------+-------+-------+-------+-------+-------+-------|
|    1 |        3 | ...   | ...   | ...   | ...   | NaN   | NaN   |
|    2 |        3 | ...   | ...   | ...   | ...   | ...   | ...   |
|    3 |        1 | NaN   | NaN   | ...   | ...   | ...   | ...   |
|    4 |        2 | ...   | ...   | NaN   | NaN   | ...   | ...   |
|    5 |        3 | ...   | ...   | ...   | ...   | ...   | ...   |
|    6 |        4 | ...   | ...   | ...   | ...   | ...   | ...   |

The dots represent floating point values and the NaNs signal that no data is available for that time stamp for the given site.

My question is about the way of representing the data in root. The difficulty I have is that root offers to many possibilities :wink: Searching the forums I found some partial answers, but since I am not familiar with some of root’s terminology, I am a bit overwhelmed.

I understand that the best representation will depend on the type of analysis I need to do. In my case, I would like to:

  1. Fit analytical models to time series of a variable, so I am using subsets of a row of the table above : v1_t1, v1_t2, v1_t3. These are usually long time series and I would also be able to do some spectral analysis (FFT, etc.)
  2. Do statistical analysis on one or several variables, so I will use several columns of the table.
  3. Do some classification or clustering as for example predicting the category value of a site using the variables v1_t1, v1_t2, etc. as classification features or predictors. The same thing with regression in order to predict a variable using the others.

When I do this using other tools, my data is just a matrix and I slice it or subset it by rows and or columns, put the data in an std::vector and run the number crunching.

Reading the root manuals and examples, I feel that my data could be stored in a ntuple (or a tree) if I want to work using columns. But then I don’t know how to extract rows to build a time series for a site (working with rows).

I guess my message is too long and not precise enough, but any suggestion will be helpful.

Thank you.

Garjola.

Yes, ideally you would write your data either as a TTree or a TNtuple. A TTree can hold any C++ class in it, while a TNtuple can only hold numbers in a table, which is what you have. ROOT is also capable of reading data in SQLite format. For processing the data once it’s in ROOT format, I recommend that you take a look into using TDataFrame here.

Hi Garjola,

from what I can read, ROOT seems to be a perfect match for you. @amadio is right: you can always resort to the columnar storage of root, being it a TTree or a TNtuple. One way, which is the one I would suggest, is to resort to the TDataFrame, as I suggested in this other post: Convert SQLite Database into ROOT TTree

While there we are discussing a conversion from SQL to CSV and then analyse with ROOT, one could even think, thanks to the ROOT TSQLFile class, to write a TDataSource to read directly from SQL files.

Cheers,
Danilo

Hi,

Thanks for your replies. I have answered about reading the SQLite file in the other thread.

Using TCsvDS is indeed very nice to convert the csv into a data frame. And TDataFrame’s functional and lazy approach is very elegant.

So

 auto fileName = "../data/input_samples.csv";
 auto tdf = ROOT::Experimental::TDF::MakeCsvDataFrame(fileName);
 auto filteredEvents = tdf.Filter("code == 211").Count();
 std::cout << "Count = " << *filteredEvents << '\n';

works with my data.

However, I have the feeling that I can only work using columns (which is nice for histograms and correlations between variables). But I still don’t see how to easily extract a subset of the variables for a row (a slice of the time series for a site, like v1_t1, v1_t2, …, v1_tn in my table above) in order to plot it or perfom a polynomial fitting.

Is there an example of how to do this?

Thanks.

Garjola

Hi Garjola,

great that this works for you!
You cannot extract a row given a certain set of columns and an entry number in one go at the moment, but you can probably achieve that with Take: https://root.cern.ch/doc/master/classROOT_1_1Experimental_1_1TDF_1_1TInterface.html#aedf4a1baa151db9813639aed3b2afbc9
This allows you to extract a column as a vector. So what one could do is the following:

auto d = mytdf.Filter([...]);
auto col0 = d.Take<col0Type>("col0");
auto col1 = d.Take<col1Type>("col1");
[...]
auto colN = d.Take<colNType>("colN");
auto myRow = std::make_tuple(col0[23], col1[23], col2[23]);

Do you think this could work for you?

Cheers,
D

Hi,
Thanks for the suggestion. I think this is a bit tedious since N=50 for me. But since the types of the columns that I want to extract are the same, I think I cand do this in a loop and put everything in a std::vector instead of a tuple. I think I can figure it out with the help you just gave me.

The next step is fitting polynomials to the time series which results from this extraction. Which is the best class to use for this? TF1, TGraph, TH1D?

Thanks.

Garjola

Hi,

having the same types greatly simplifies indeed. At that point it’s a loop to fill a collection of vectors, right?

I think that for the time series, the best class is the TGraph. I can point you to this example that shows how to build a timeseries from a csv: https://root.cern/doc/master/timeSeriesFromCSV__TDF_8C.html

Then the TGraph::Fit method should do the job.

Cheers,
D

Hi,

Great! This seems to be what I need.

Thanks again for your help.

Garjola

Hi,
it’s true that TDF is very column-oriented – this means that you need to create an extra column that stores all your variables in a vector, as you say. A relatively painless way to do it is the following:

tdf.Define("vectorOfColumns", "std::vector<double>({a,b,c,d,e}");

where a,b,c,d,e are the names of the columns that you want in the vector. You can build that string programmatically, so you can avoid typing the list of 50 column names explicitly.

This costs an extra copy of those 50 numbers but I hope it’s not a problem.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.