ROOT file vs SQLite

1010111001000011 · July 9, 2022, 9:54am

Not a ROOT user and having a very generic question: How does a ROOT file compares to an SQLite file? Advantages and disadvantages, etc.

Context: I am doing my PhD in radiation detectors. I use Python a lot, but practically never used ROOT. I am doing some measurements that produce huge amounts of data, e.g. 1-4 GB per run. This runs on an ordinary computer in the lab which controls the measuring setup. I need to scale up a bit and I’m currently looking for a solution to store this data, as it grows while the measurement is ongoing. I have used SQL (well, actually SQLite which integrates very well with Pandas) in the past. I could go for this option, but recently I was talking with a colleague and he told me he uses ROOT files for this and that they are very fast to read (but slow to write?). So now I am curious to see if it may be better. I guess in this case I would use PyROOT. From my limited point of view:

SQLite:
- Can handle huge databases.
- Very widespread in any industry, has been there for about 20 years.
- Lots of help in StackOverflow.
- Cross platform and trivial to install (if not shipped by default with the OS).
ROOT file: Convince me .
- File size?
- Read speed as compared to SQLite?
- Python friendly?
- What kind of data/format is better for?

bellenot · July 9, 2022, 7:13pm

Welcome to the ROOT Forum! I think this is a question for @pcanal , @Axel or even @eguiraud

Axel · July 10, 2022, 8:58am

ROOT is optimized for streaming large amounts of data, possibly selecting only a subset of columns when reading. File size will be smaller than with SQLite. Reading will be significantly faster - for the kind of data and data processing that HEP uses, i.e: many repetitions of a similar data structure (schema), too much to fit into memory, with possibly varying size (possibly nested) collections.

Oh, and schema evolution built in. You will be convinced that you developed the perfect data format for your detector. But if you ever come up with something that’s slightly more perfect you’d have to migrate your SQLite data. Not so with ROOT: we cover a very large set of schema changes transparently, and that has been tested by production use for decades.

I hope that helps. I don’t know your data format (i.e. the schema for your measurements) - that should be the deciding factor.

With RDataFrame it should also be significantly simpler to write ROOT TTrees; with Conda it should be significantly simpler to set ROOT up. Let us know if we can help with anything!

1010111001000011 · July 10, 2022, 9:46am

Thanks. I am probably dealing with an almost “prototype use case”, but here I provide a more detailed description. At the moment I have two different types of schemas (hope I properly understood the meaning of the term), one for the raw waveforms I measure with the oscilloscope out of the detectors, and then one for the “parsed waveforms” after I extract the features of each. This is how they look like for one example measurement (that still fits in memory):

Waveforms (waveforms_df):

                            Time (s)  Amplitude (V)
n_trigger device_name                              
0         HPK-4         7.000000e-08        -0.2558
          HPK-4         7.002500e-08        -0.2558
          HPK-4         7.005000e-08        -0.2558
          HPK-4         7.007500e-08        -0.2558
          HPK-4         7.010000e-08        -0.2558
...                              ...            ...
2221      Photonis PMT  1.699250e-07        -0.2530
          Photonis PMT  1.699500e-07        -0.2530
          Photonis PMT  1.699750e-07        -0.2530
          Photonis PMT  1.700000e-07        -0.2530
          Photonis PMT  1.700250e-07        -0.2530

[17784888 rows x 2 columns]

So then e.g. waveforms_df.loc[(0,'HPK-4')] returns a single waveform, Time (s) and Amplitude (V) you can plot.

After “parsing” the previous data frame (this means iterate over each individual waveform) I produce a structure that looks like this (measured_data_df):

                        Amplitude (V)  Bias current (A)  Bias voltage (V)  ...      t_70 (s)      t_80 (s)      t_90 (s)
n_trigger device_name                                                      ...                                          
0         HPK-4              0.085514          0.000000              77.0  ...  1.227209e-07  1.228223e-07  1.229736e-07
          Photonis PMT       0.071317          0.000095            2700.0  ...  1.198947e-07  1.199238e-07  1.199613e-07
1         HPK-4              0.077032          0.000000              77.0  ...  1.228937e-07  1.230124e-07  1.231312e-07
          Photonis PMT       0.062531          0.000095            2700.0  ...  1.198288e-07  1.198567e-07  1.198942e-07
2         HPK-4              0.007026          0.000000              77.0  ...  1.343812e-07  1.343875e-07  1.343937e-07
...                               ...               ...               ...  ...           ...           ...           ...
2219      Photonis PMT       0.157963          0.000095            2700.0  ...  1.200711e-07  1.201036e-07  1.201363e-07
2220      HPK-4              0.006759          0.000000              77.0  ...  9.925689e-08  9.926293e-08  9.926896e-08
          Photonis PMT       0.065698          0.000095            2700.0  ...  1.198620e-07  1.198913e-07  1.199207e-07
2221      HPK-4              0.006115          0.000000              77.0  ...  7.533362e-08  7.533908e-08  7.534454e-08
          Photonis PMT       0.068961          0.000095            2700.0  ...  1.198718e-07  1.199009e-07  1.199317e-07

[4444 rows x 30 columns]

(Note that Amplitude (V) here is a single number representing the amplitude of the signal, it is not anymore the one in the waveforms_df that represents the amplitude vs time.)

The waveforms_df dataframe is created on the fly while the measurement is ongoing and is the more demanding in term of resources, and the one I am more interested in porting either to ROOT or SQL. For this one I would like to regularly store the waveforms in the disk appending the new data to a file.

Once the data acquisition is finished I “parse” the waveforms. For this I have to read the waveforms file, probably in bunches as it is too big to fit in memory, and individually for each waveform extract the features and create the measured_data_df which for the moment I think I still have enough room to operate with “plain Pandas”. But could also be migrated to other format in the same way as the waveforms.

Currently I am using the feather¹ format with which I am very happy as it is very fast and lightweight. But has the issue you cannot append new data and also cannot read by bunches, which is my limiting factor now.

¹ New users cannot post links, so remove the spaces or just search for pandas feather: https: //pandas. pydata. org /docs/reference/api/pandas.DataFrame.to_feather.html

Wile_E_Coyote · July 11, 2022, 7:12am

Do you control the way your “raw data” (“waveforms”) files are stored or is it “predefined” by your “data acquisition” system (or the oscilloscope’s “firmware”)?
(In any case, for “backup” purposes, I would keep them in some easily readable well defined format, NOT ROOT).

BTW. I guess the best idea would be to talk to your local colleagues who are familiar with ROOT. You could discuss with them how to “import” your “raw data” into ROOT.

Axel · July 28, 2022, 9:49am

Our experience is underlining your concerns - that things look nice initially, but then they don’t scale and you need to re-invent everything or come up with workarounds that need to be maintained and make things way less efficient.

If you have structured data ( - you do) and if the data won’t fit in memory ( - iiuc it won’t at some point) then ROOT is what you want to use. If you like pandas’ dataframes etc then consider starting directly with RDataFrame - that’s way simpler (and faster) than dealing with TTree yourself, and it totally scales to gigantic volumes of data.