Best way to store digitizer (or oscilloscope) data for analysis

Massimo_Girola · December 14, 2020, 6:44pm

Hi rooters,

I’m working with a digitizer (or oscilloscope) and I have a python class to read the collected data.
Does it makes sense to try to store data using a TTree or a TChain for later analysis?
Notice that my data are constituted by “events” where each of these “events” contains lots of data (from a few thousands to hundreds of thousands) points of the form (x, y, ex, ey) where x represents time and y represents voltage. So a single data file is quite heavy becauese it contains lots of events and each event represent a full waveform.
I wonder how to use TTrees or TChains to store this kind of data and if it makes sense to use this approach. Ideally I would like to end up with a TTree where each entry contains a full waveform so that later I can process the Tree event-by-event to evaluate quantities starting from my waveform (i.e. fitting TGraphErrors, plot numeric derivatives, locate peaks and so on).
Maybe a workaround would be to try to create a TChain where each TTree stores a single event, but I have never worked with these objects so I’m not sure (something similar to this post).

I would appreciate any help. Thanks in advance.

ROOT Version: 6.18/00
Platform: Ubuntu 20.10
PyROOT python 2.7 or c++

etejedor · December 15, 2020, 9:43am

Hello,

Instead of TTree/TChain, I’d recommend you do this with RDataFrame:

which is a higher-level API.

Initially, you would create a new dataset from your collected data, a bit like it’s shown here:

So you would define how many events you want in your dataset, the columns of your dataset and then store that (snapshot) in a ROOT file. Note that you can use that ROOT file to create an RDataFrame too for later analysis (in the docs you can find everything you can do with RDataFrame).

Alternatively, in case you are currently storing the data you collect in a CSV file, you can load it directly from RDataFrame with the CSV data source:

One of the advantages of using RDataFrame is that you get parallelization on your local cores for free.

Please let us know if this model suits your use case and whether we can help you with anything else.

Enric

Massimo_Girola · December 15, 2020, 10:57am

Hello,
I went through the docs you linked and an RDataFrame-type of dataset seems suitable for my purposes. It has great functionalities like the filter transformation and also it would be amazing to get parallelization for free.
The only problem is that I am still confused on how to store data: the problem is that each one of my “events” is constituted by lots of points representing a full waveform, so let’s say that for each of my “events” i have a file of two columns out of which I can create an RDataFrame-type dataset, but then should I create one of these datasets for each of my “events”? Basically I have the same doubts I had when I tried with TTrees: since one of my “events” contained multiple datapoints I was confused on how to associate a full waveform to each entry instead of a simple numeric type. May I ask you suggestions on how to handle this kind of data structure with RDataFrame? Maybe I am just missing some functionality.

Anyway thank you a lot for telling me about RDataFrames, even if I change strategy they will be very helpful and definitely more appropriate than TTrees.

etejedor · December 15, 2020, 11:30am

Hi,

It seems then that you dataset will have columns of type collection that, for each event, store the waveform points. You can’t use the CSV datasource for that, you would need to define each column to be a collection of elements. The function call that you provide in the define should return that collection. We have this tutorial:

which shows how to do that with ROOT’s RVec class, which stores the collection.

Calling @eguiraud in case he wants to comment.

Massimo_Girola · December 15, 2020, 2:27pm

Ok, it seems exactly what I want.
Now the problem is that I’m struggling in trying to create the columns and add my data.
In the example it uses this part of code to generate the data:

coordDefineCode = '''ROOT::VecOps::RVec<double> {0}(len);
                     std::transform({0}.begin(), {0}.end(), {0}.begin(), [](double){{return gRandom->Uniform(-1.0, 1.0);}});
                     return {0};'''
d = df.Define("len", "gRandom->Uniform(0, 16)")\
      .Define("x", coordDefineCode.format("x"))\
      .Define("y", coordDefineCode.format("y"))

I do not understand this part, in particular I don’t understand what is coordDefineCode and what it does.
In my case I have an xml data file and I can retrieve data with an intricate python class. I can’t understand how to pass this data to the Define method.
I have tried with something like this but it does not work:

    def generate_dataset(self, outdir, digichannel, sampfreq):
        Nevts = 0
        evts = []
        while True:
	        evt = self.get()  #get the event info
	        if evt is None: break
	        wvf = evt.get("channels")[digichannel] #wvf is a list that contains the waveform
	        evtid = evt.get("id")                  #get the event id
	        digivolt = ROOT.VecOps.RVec("double")(len(wvf))
	        digitime = ROOT.VecOps.RVec("double")(len(wvf))
	        for i,sample in enumerate(wvf):
	            digivolt[i] = sample
	            digitime[i] = i/sampfreq
	        evts.append((digivolt,digitime))
	        Nevts += 1
        df = ROOT.RDataFrame(Nevts)
        d = df.Define("x", evts[0][0] )\
            .Define("y", evts[0][1])
        
        fileName = outdir
        treeName = "myTree"
        d.Snapshot(treeName, fileName)

I think I’m missing the logic behind it, how do I write an expression to give as input to the second parameter of define so that it retrieve the data from my code?
Thanks again for all your help.

etejedor · December 16, 2020, 10:04am

Hello,

The second argument that you provide to Define is a string with a C++ expression in it (even if you run this from Python). In the VecOps example, coordDefineCode contains some C++ code that will generate an RVec for a entry of the resulting dataset. The key point here is precisely that: Define will call that C++ code for each entry and store the generated value.

In practice, this means that for your case you need to provide some C++ code to Define that returns the value for a row of the column, i.e. a collection of elements representing the points, which could be an RVec. That code could be a function call that reads the necessary data from an xml file and puts it in an RVec.

@eguiraud you see any option that would be more straightforward?

Massimo_Girola · December 18, 2020, 4:25pm

Hello again,
I tried a lot of different things since your last post but I am still not succeeding.
The problem is that the xml format I have is quite strange and it’s difficult to read it using c++.
It would be great to do what you said by pass to the C++ line the variables I have in python.
I found this post which explains a way to this but I can’t find a way to return an RVec or an std::vector<double> instead of a simple double type. I’ve also tried with something similar to this other post but I got nothing.
I would appreciate any help to do this, anyway I think I have another way out: I could just tell python to write everything in a .txt file and then read it with C++, still I would prefer not to do this because I would double the size occupied by my data and it would be more computationally expensive.
Thanks anyway for all your help.

Massimo_Girola · December 19, 2020, 7:57pm

Just for the record, I ended up by making a copy data into a txt file so that I could use c++ to read them:

int main()
{
	string fileName = "provach0.txt";
	std::pair< ROOT::VecOps::RVec< ROOT::VecOps::RVec< double > >,ROOT::VecOps::RVec< ROOT::VecOps::RVec< double > > > events = read_pyoutput(fileName); //read data from txt file
 	int i = 0, Nevts = events.first.size();
	ROOT::RDataFrame d(Nevts);
   	auto d0 = d.Define("event"      , [&](     ) { i++; return i-1;           }            )     //create RDataFrame columns
               .Define("t"          , [&](int j) { return (events.first)[j];  }, {"event"} )    
               .Define("ADC_channel", [&](int j) { return (events.second)[j]; }, {"event"} );
    d0.Snapshot(fileName.c_str(), "provach0.root", {"event","t","ADC_channel"} ); 
   	return 0;
}

even if it is computationally heavy to copy and read again all the data it works.
I just wonder how the TTrees in the output files are created and why some of them have multiple branches. Anyway I am pretty happy with this result and I think I’m ready to start the analysis.
Thanks again for all your help.

system · January 2, 2021, 7:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

etejedor · January 4, 2021, 10:24am

Hello,

Many thanks for your efforts in turning your analysis into an RDataFrame shape!

It’s a pity that you have to do this extra conversion first. Also, the current approach needs all your data to fit in memory. Ideally, the functions that define your columns should read the column entries one at a time from a file to remove this restriction (but again, you would need to read your XML from C++ and that seems to be tricky).

Regarding your question about TTrees, they store data in a column-based format to make it more efficient to read particular columns (note that column and branch here are the same). RDataFrame uses TTrees underneath, so the columns you define in RDataFrame end up being branches of a TTree.

Massimo_Girola · January 16, 2021, 3:43pm

Thanks a lot!
Yes unfortunaltely it is too much work to read the XML file from C++.
I ended up splitting my analysis into multiple scripts (three C++ scripts where the first one calls a python script):

The first one is a C++ code which calls the python script that copy data from XML into a .txt file (neck bottle of the analysis) and then read it and creaes the RDataFrame where each entry of the underneath TTree contains a full waveform (an RVec List for each column)
The second one reads the RDataFrame with the waveforms and extract simple data (each entry is now a single number for each column of the TTree)
The third one reads these simple data and performs the actual analysis (fitting histograms and so on)

It seems ok so far but I am still confused on how to benefit from the EnableImplicitMT() for the 2nd and 3rd step of the analysis: when MT is enabled it seems slower to me.
Thanks a lot for all your help, I can post the codes if you think this could be useful for the posterity.
I’ll open a new topic if I will not figure out how to make the most of MT and RDataFrame approach in general, thanks a lot!

P.S. it is great that RDataFrame functionalities are also present in the Windows version of ROOT, it is an amazing tool!

etejedor · January 18, 2021, 8:41am

Hello,

Glad to hear you like RDF! Yes please, open a new topic on the IMT discussion.

Cheers,
Enric

etejedor · January 18, 2021, 5:00pm

This topic was automatically closed after 14 days. New replies are no longer allowed.