How to read ndjson file, create a dataset (TTree) and store in a root file (TFile)?


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.30.08
Platform: Rocky Linux 8
Compiler: g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22)


Hello! I want to read a file that consists of new line delimited json rows.

Sample rows:
[“foo”,“bar”,10,20,30]

So, each row is an array that has few string elements and the rest of them are uint64. My intention is to create a Macro that will create a Tree with branches of type str and uint64, then read the file row by row (efficiently) parse it and store the values in the respective branches. Finally, save the Tree in a root file. I am new in cpp so could you provide me a with some code to work with?

Thnx

Dear dimitris,

Thanks for the post and welcome to the ROOT Community!

This is a pretty specific problem. Given the formatting of your lines, I would lean towards proposing to use PyROOT (they seem to be lists), however this is your call.

Let us know how this goes.

Cheers,
D

Hi,

since the input files are big, around 4-5gb, I would prefer steaming the rows instead of reading all file in memory and doing it in cpp instead of python for better performance.

Could you point me to a tutorial that would give me some information on how to read and parse json rows. I have seen several tutorials for reading rows that all have integers/floats but not something for string. How could I create a Branch for a string, e.g. arbitrary length of chars ?

You could preprocess the file outside ROOT first: in the simplest case you could remove the brackets and the quotes from all lines (with a simple text editor or with shell commands or a script) and save as a new file, then use TTree::ReadFile() to read it into a tree; e.g.
Suppose after removing brackets and quotes, “a.txt” contains these 2 columns:

dan d,10
will w,15
zak z,20

Then you can do:

 {
  TTree *T = new TTree("T","My tree");
  T->ReadFile("a.txt","n1/C:x/F",',');  // add branch descriptions (names & types) for each column
  T->Scan("*");
}

to get:

************************************
*    Row   *     n1.n1 *       x.x *
************************************
*        0 *     dan d *        10 *
*        1 *    will w *        15 *
*        2 *     zak z *        20 *
************************************

Note that I used comma as field separator (since you have that in your example); but if the text fields contain commas, you’ll have extra work to deal with that. As Danilo said, it all depends on your specific case. In a more general case, just google how to read/parse files in c++ and then use that in ROOT.

Thanks for your response, this is a hack but would work. Is there a way to ignore some columns (for example the first 2) and apply some function on a column during read, e.g. remove the “]” bracket from last column?