Hi,
I am learning to use root. I would like to do some statistical analysis and also time series processing and fitting. I am turning to root instead of the usual tools in my field of application (some R and a lot of python with numpy, pandas and scikit-learn) because I need integration into a C++ environment.
The data I need to process corresponds to a set of sample sites where several variables are measured over time. So for a particular site, we have, for each time stamp v1, v2, … vn. For every site, the variables are the same, but the time stamps can be different from site to site. Each site corresponds to a category and several sites can belong to the same category. We usually store these data on SQLite data base using a single table. So we actually have tabular data like this (for 2 vars and 3 possible time stamps):
| site | category | v1_t1 | v2_t1 | v1_t2 | v2_t2 | v1_t3 | v2_t3 |
|------+----------+-------+-------+-------+-------+-------+-------|
| 1 | 3 | ... | ... | ... | ... | NaN | NaN |
| 2 | 3 | ... | ... | ... | ... | ... | ... |
| 3 | 1 | NaN | NaN | ... | ... | ... | ... |
| 4 | 2 | ... | ... | NaN | NaN | ... | ... |
| 5 | 3 | ... | ... | ... | ... | ... | ... |
| 6 | 4 | ... | ... | ... | ... | ... | ... |
The dots represent floating point values and the NaNs signal that no data is available for that time stamp for the given site.
My question is about the way of representing the data in root. The difficulty I have is that root offers to many possibilities Searching the forums I found some partial answers, but since I am not familiar with some of root’s terminology, I am a bit overwhelmed.
I understand that the best representation will depend on the type of analysis I need to do. In my case, I would like to:
- Fit analytical models to time series of a variable, so I am using subsets of a row of the table above : v1_t1, v1_t2, v1_t3. These are usually long time series and I would also be able to do some spectral analysis (FFT, etc.)
- Do statistical analysis on one or several variables, so I will use several columns of the table.
- Do some classification or clustering as for example predicting the category value of a site using the variables v1_t1, v1_t2, etc. as classification features or predictors. The same thing with regression in order to predict a variable using the others.
When I do this using other tools, my data is just a matrix and I slice it or subset it by rows and or columns, put the data in an std::vector and run the number crunching.
Reading the root manuals and examples, I feel that my data could be stored in a ntuple (or a tree) if I want to work using columns. But then I don’t know how to extract rows to build a time series for a site (working with rows).
I guess my message is too long and not precise enough, but any suggestion will be helpful.
Thank you.
Garjola.