Thanks. I am probably dealing with an almost “prototype use case”, but here I provide a more detailed description. At the moment I have two different types of schemas (hope I properly understood the meaning of the term), one for the raw waveforms I measure with the oscilloscope out of the detectors, and then one for the “parsed waveforms” after I extract the features of each. This is how they look like for one example measurement (that still fits in memory):
Time (s) Amplitude (V)
0 HPK-4 7.000000e-08 -0.2558
HPK-4 7.002500e-08 -0.2558
HPK-4 7.005000e-08 -0.2558
HPK-4 7.007500e-08 -0.2558
HPK-4 7.010000e-08 -0.2558
... ... ...
2221 Photonis PMT 1.699250e-07 -0.2530
Photonis PMT 1.699500e-07 -0.2530
Photonis PMT 1.699750e-07 -0.2530
Photonis PMT 1.700000e-07 -0.2530
Photonis PMT 1.700250e-07 -0.2530
[17784888 rows x 2 columns]
So then e.g.
waveforms_df.loc[(0,'HPK-4')] returns a single waveform,
Time (s) and
Amplitude (V) you can plot.
After “parsing” the previous data frame (this means iterate over each individual waveform) I produce a structure that looks like this (
Amplitude (V) Bias current (A) Bias voltage (V) ... t_70 (s) t_80 (s) t_90 (s)
n_trigger device_name ...
0 HPK-4 0.085514 0.000000 77.0 ... 1.227209e-07 1.228223e-07 1.229736e-07
Photonis PMT 0.071317 0.000095 2700.0 ... 1.198947e-07 1.199238e-07 1.199613e-07
1 HPK-4 0.077032 0.000000 77.0 ... 1.228937e-07 1.230124e-07 1.231312e-07
Photonis PMT 0.062531 0.000095 2700.0 ... 1.198288e-07 1.198567e-07 1.198942e-07
2 HPK-4 0.007026 0.000000 77.0 ... 1.343812e-07 1.343875e-07 1.343937e-07
... ... ... ... ... ... ... ...
2219 Photonis PMT 0.157963 0.000095 2700.0 ... 1.200711e-07 1.201036e-07 1.201363e-07
2220 HPK-4 0.006759 0.000000 77.0 ... 9.925689e-08 9.926293e-08 9.926896e-08
Photonis PMT 0.065698 0.000095 2700.0 ... 1.198620e-07 1.198913e-07 1.199207e-07
2221 HPK-4 0.006115 0.000000 77.0 ... 7.533362e-08 7.533908e-08 7.534454e-08
Photonis PMT 0.068961 0.000095 2700.0 ... 1.198718e-07 1.199009e-07 1.199317e-07
[4444 rows x 30 columns]
Amplitude (V) here is a single number representing the amplitude of the signal, it is not anymore the one in the
waveforms_df that represents the amplitude vs time.)
waveforms_df dataframe is created on the fly while the measurement is ongoing and is the more demanding in term of resources, and the one I am more interested in porting either to ROOT or SQL. For this one I would like to regularly store the waveforms in the disk appending the new data to a file.
Once the data acquisition is finished I “parse” the waveforms. For this I have to read the waveforms file, probably in bunches as it is too big to fit in memory, and individually for each waveform extract the features and create the
measured_data_df which for the moment I think I still have enough room to operate with “plain Pandas”. But could also be migrated to other format in the same way as the waveforms.
Currently I am using the feather¹ format with which I am very happy as it is very fast and lightweight. But has the issue you cannot append new data and also cannot read by bunches, which is my limiting factor now.
¹ New users cannot post links, so remove the spaces or just search for pandas feather: https: //pandas. pydata. org /docs/reference/api/pandas.DataFrame.to_feather.html