Creating RDataframe with Column values of type RVec reading data from external file

Hi,
I am trying to learn how to fill RDataFrame from a text file, and keep getting stumped.

My data are organized in huge text files (more than 2GB) with 2 columns (in future they could become 3 or 4). These data represent waveforms collected with a digitizer: in each of the 2 columns I have a full waveform (event) every each N lines; so the file is constituted by a total number of lines equal to the number of events times the number of samples in each event (NTotLines = nEvents*nSamplesInEachEvent).
What I want to do is read data from the txt file and save them in an RDataFrame where the first column is the number of the event and the second (and third) column contains in each entry an RVec<int> of size nSamplesInEachEvent containing the full waveform for that event.

My first attempt was to use a function which first loaded all the events in a std::vector<RVec<int>>, and then copied each element of the std::vector in the RDF. Unfortunately this approach used a lot of memory and while it was ok with smaller data files now that they became bigger it is not feasable anymore.

So I tried writing a function functions to define the RDF columns by reading the file one event at a time (a total of nSamplesInEachEvent lines) and returning a single RVec to save in that entry of the column. Unfortunately without success.
I paste my code:

	ifstream inFile(txtOut);
    if( !inFile.is_open() )
		throw std::invalid_argument( "impossible to open file" );
	
	ROOT::VecOps::RVec<int> WVFofThisEvent; //declare RVec where to store data
	int aux = 0, voltage = 0;
	auto getch0WVF = [&](){//lambda function to get WVF, returns a WVF contained in RVec of length sampInOneEvent, the next time it is called it creates a new RVec with the next event
		for(unsigned int i = 0; i<sampInOneEvent; i++ )
		{
			if (nExistingChs == 1) //nExistingChs is the number of columns in the original txt file
				inFile>>voltage;
			else if (nExistingChs == 2)
				inFile>>voltage>>aux;
			else
				throw invalid_argument( "nExistingChs is not in the possible values" );
			//cout<<voltage<<aux<<endl;
			WVFofThisEvent.push_back(voltage);
		}
		//cout<<WVFofThisEvent<<endl;
		return WVFofThisEvent;
	};

	auto trivial_DF_fill = [](){ //lamda function to fill a column with the number of the entry
		static ULong64_t i = 0;
    	return i++;
   	};
   	
	ROOT::RDataFrame d(nEvents); //create RDataFrame with right number of events
	std::cout<<"Starting to fill the DF by reading data in the txt file..."<<endl;
	auto d1 =  d.Define("event",   trivial_DF_fill )
	            .Define("ch0"  ,   getch0WVF );
	inFile.close();

    d1.Snapshot( "WVF", rootOut.c_str(), {"event","ch0"} );//save dataframe in external file

If I uncomment the first print out line it turns out that values stored in variables aux and voltage are 0.

Does someone have suggestions on how to do this?
I would appreciate any help.
Thanks in advance.

ROOT Version: 6.18/00
Platform: Ubuntu 20.10
Compiler: GCC


Hi,
as far as I can tell from your report, inFile>>voltage>>aux; is not doing its job (i.e. this should be unrelated to RDataFrame). If you just execute

    ifstream inFile(txtOut);
    if( !inFile.is_open() )
		throw std::invalid_argument( "impossible to open file" );
   	int aux = 0, voltage = 0;
    inFile>>voltage>>aux;
    cout<<voltage<<aux<<endl;

does that print correct voltage and aux values?

Note that you are also missing a WVFofThisEvent.clear() at the beginning of getch0WVF, otherwise each event will include all the previous ones.

Cheers,
Enrico

Yes, after some other attempts I concluded that for some reason the lambda function is not able to update the values of voltage and aux.
In fact, if I do this:

   	int aux = 0, voltage = 0;
    inFile>>voltage>>aux;
    cout<<voltage<<aux<<endl;

	ROOT::VecOps::RVec<int> WVFofThisEvent; //declare RVec where to store data
	//int aux = 0, voltage = 0;
	auto getch0WVF = [&](){//lambda function to get WVF, returns a WVF contained in RVec of length sampInOneEvent, the next time it is called it creates a new RVec with the next event
		WVFofThisEvent.clear();
		for(unsigned int i = 0; i<sampInOneEvent; i++ )
		{
			if (nExistingChs == 1)
				inFile>>voltage;
			else if (nExistingChs == 2)
			{
				inFile>>voltage>>aux;
				cout<<voltage<<aux<<endl;
			}
			else
				throw runtime_error( "nExistingChs is not in the possible values" );
			
			WVFofThisEvent.push_back(voltage);
		}
		//cout<<WVFofThisEvent<<endl;
		return WVFofThisEvent;
	};

the first cout works and it prints the first two values in the txt file. But then the second cout inside the lambda function just prints the same two values again and again, without reading the other ones.

Also I have tried to read the whole file with:

void read_file(string txtFile)
{
	int voltage=0, aux=0;
	ifstream file(txtFile.c_str());
	while( file.eof() == false )
	{
		file >> voltage >> aux;
		cout << voltage << aux << endl;
	}
	file.close();
}

and it prints the correct values.

Running that logic in the lambda should be exactly the same as running it outside (I assume you are not using ROOT::EnableImplicitMT(), because that code is not thread-safe and cannot be run in a multi-thread event loop).

Ah, wait! You are calling inFile.Close() too soon. RDataFrame is lazy, it doesn’t actually compute the Defines until you call Snapshot!

Wonderful, that was the problem! Thanks a lot, now it works fine.
Here is my final code:

	ifstream inFile(txtIn); //open input file
	ROOT::VecOps::RVec<int> WVFofThisEvent; //declare RVec where to store data
	int aux = 0, voltage = 0; //aux is just an auxiliary variable
	auto getch0WVF = [&](){//lambda function to get WVF, returns a WVF contained in RVec of length sampInOneEvent, the next time it is called it creates a new RVec with the next event
		WVFofThisEvent.clear();
		for(unsigned int i = 0; i<sampInOneEvent; i++ )
		{
			if( !inFile.is_open() )
				throw std::runtime_error( "impossible to open file" );
			if (nExistingChs == 1)
				inFile>>voltage;
			else if (nExistingChs == 2)
				inFile>>voltage>>aux;
			else if (nExistingChs == 3)
				inFile>>voltage>>aux>>aux;
			else if (nExistingChs == 4)
				inFile>>voltage>>aux>>aux>>aux;
			else
				throw runtime_error( "nExistingChs is not in the possible values" );
			
			WVFofThisEvent.push_back(voltage);

		}
		if( inFile.eof() )
			throw std::runtime_error( "reached end of file while reading data" );
		return WVFofThisEvent;
	};
	
	auto trivial_DF_fill = [](){ //lamda function to fill a column with the number of the entry
		static ULong64_t i = 0;
    	return i++;
   	};

   	ROOT::RDataFrame d(nEvents); //create RDataFrame with right number of events
	std::cout<<"Preparing RDF columns: it is a lazy action and it will be performed only when explicitly requested (es: by Snapshot)"<<endl;
	auto d1 =  d.Define("event",   trivial_DF_fill )
				.Define("allChannels", getch0WVF   );

	std::cout<<"Saving a Snapshot of the DF on disk...\n"<<endl;
    d1.Snapshot( "WVF", rootOut.c_str(), {"event","ch0","ch1"} );//save dataframe in external file
	inFile>>aux;
	if( inFile.is_open() and !inFile.eof() )
		std::cout<<"\n!!!!!!!!!WARNING: you probably haven't read all the data!!!!!!!!!!!!\n"<<endl;

Just a side note I’d like to mention that I have tried to plot ch0 vs event and I get some strange errors when it tries to create the TGraph (I suspect that it has to do with the fact that the first column is int while the second is of container type, i get the error: call to member function 'Exec' is ambiguous).

	auto c1 = new TCanvas("mygraph","mygraph",200,10,700,500);
	auto myGraph1 = d1.Graph("event", "ch0");
	myGraph1->Draw("AP");
	c1->SaveAs((outDir+"myGraph.pdf").c_str());

The strangest thing is that if I open the file in a TBrowser and I use the Tree Viewer to produce the same plot it works just fine and it plots exactly what I expected (not that I really needed that plot, it was just a sanity check).
Thanks again for your help.

Glad that’s solved. If you have time, please do open a GitHub issue about the TGraph problem at https://github.com/root-project/root/issues.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.