Help with code to read delimited files into ROOT

Hi All,
I am writing some code to read an arbitrarily delimited data file into ROOT, and ran into some issues. Maybe someone here can easily identify what the problem is ?

First, the goals of this code:
(1) It will read an arbitrarily delimited file into ROOT (including missing values in the file, e.g., two consecutive delimiters with nothing or whitespace between them).
(2) By reading the first few lines of the file, it will guess the type (numeric or character) of each column of data.
(3) Based on the information above it will book an appropriate tree and fill it(missing numeric values will be coded to some user-defined specific value, e.g., -99)

By the way, such interfaces to read any arbitrary data exist in most statistical packages. TTree::ReadFile() can only read very strictly formatted files; a lot of raw data does not come in such a nice format. So I think inability of ROOT to easily read an arbitrarily formatted data can deter some people from using ROOT and the power it provides. This is a humble attempt to fill this gap.

Now, let me describe the problem.

The attached file “ReadData.C” is the code I wrote. It reads the attached file “SampleData.txt”, and the resulting ROOT file is the attached “SampleData.root”.

I see two problems at the moment that I can not figure out:

(1) While Var2 is correctly identified as Character, in the filled tree, the values don’t show up. I am not sure why. The numeric values seem to appear in the right format in the resulting tree.
(2) If you look carefully at line 2 (first line of data, after the header) of SampleData.txt, you will see that the value of Var3 is missing. There are two commas next to each other, with nothing in between. I thought the way I wrote the code, this should result in a value = -99 (for missing value). However, in the resulting tree, no such value is seen for Var3. So I think there is something wrong with the regular expression I am using to parse the data, or something else.

Any ideas ? If this works out, we can clean this up, add some more functionality to it, and hopefully incorporate into ROOT someday.

Many thanks in advance.
-Arun
SampleData.root (5.65 KB)
SampleData.txt (490 Bytes)
ReadData.C (5.93 KB)

I suggest to compile your code with ACLIC, ie

root > .L ReadData.C+ root > ReadData();
You have to move the function ReadData at the end of the file and declare a few more includes.
The main request we had so far is to read a selected set of the N columns. It would be nice f you could consider this when improving your proposed script ::slight_smile:

Rene

Hi Rene,
Thanks for the hints. And I agree it is a good idea to add the capability to read N selected columns. However, I need to get over one basic hurdle first.

I made all the changes you suggested, and the script now compiles with ACLIC. But I still get empty entries for the character field (Var2 in the attached tree). And I can not figure out why. The numeric fields appear to be OK.

I am attaching the complete updated code, and the resulting ROOT file (the data file used was the same as in the previous post, SampleData.txt). I am also reproducing below what I think are the relevant parts of the code for this issue. Maybe something jumps out ?

Thanks.
-Arun

/***** Part of code booking the tree *******/
// Book a tree
TTree *tree = new TTree(“tree”," ");

// Vectors to keep the values of variables
unsigned int nvars = fVarNames.size();
std::vector fCharValues(nvars);
std::vector <Float_t> fFloatValues(nvars);

// Assign branch addresses
TString BranchDescription;

for (unsigned int i = 0; i < nvars; i++) {
BranchDescription = fVarNames[i];
if (toupper(fVarTypes[i]) == ‘C’) {
BranchDescription += “/C”;
tree->Branch(fVarNames[i].Data(), (void *) fCharValues[i].Data(), BranchDescription.Data());
} else {
BranchDescription += “/F”;
tree->Branch(fVarNames[i], &fFloatValues[i], BranchDescription.Data());
}
} // end of loop over nvalues to create branches

/****** Part of the code filling the tree**********/
while (1) {
s1.ReadLine(infile);
// cout << s1 << endl;
Int_t index = 0;
Int_t start = 0;
while (1) {
s2 = s1(rexp, start);
s3 = s2.Strip(TString::kTrailing, fDelimiter);

  int len = s2.Length();
  if (len <= 0) break;
  start += len;

  if (toupper(fVarTypes[index]) == 'F') {
if (s3.IsNull() || s3.IsWhitespace() ) {
  fFloatValues[index] = -99.0;
} else {
  fFloatValues[index] = s3.Atof();
}
  } else {
  fCharValues[index] = s3;
  }      
  index++;
}
if (!infile) break;
tree->Fill();
LineCount++;

}
SampleData.root (5.65 KB)
ReadData2.C (6.04 KB)

Trying to investigate your problem (on Linux), I see 2 other problems in your code:
-the compiler complains at the line

TString s = "([\(\)+-\\w\\s.:]+)"; with the message:

/home/brun/root/./ReadData2.C:49:15: warning: unknown escape sequence '\)'
-your logic seems to be wrong when the first data line contains 2 consecutive comma. If I replace

1.99893,Hello,,3.56524,5.7818,5.96995 by

1.99893,Hello,1.11,3.56524,5.7818,5.96995 I get some improvements. Before going further, you should fix these 2 problems first.

Rene

Hi Rene,
Thanks. I have updated the code to now use TPMERegexp::Split() to read the tokens (was using TPRegexp class before). I also made changes to the regular expression used to get the tokens, and added a few more set/get methods etc.

The code now compiles without errors/warnings (Ubuntu 8.10 + ROOT 5.21.06). Also, it now parses the data correctly. For example, it identifies two consecutive delimiters as a missing value correctly. Also, it does not split on a delimiter, if the delimiter is within a quoted string. For example, “Hello, how, are, you” will be identified as a single token and not four in a CSV. I tested these things, and it works.

But my earlier problem remains. The character variables are not being populated in the resulting tree. I believe there is a problem in either how I am calling the TTree::Branch() method, or how I am filling it, or both. Please let me know if you see a problem.

I am attaching the updated code, the sample data file I used to test the code, and the resulting tree. Note that there are two lines (2nd and 7th) in the sample data file where the value of Var3 is missing (two consecutive commas, with nothing inbetween). The resulting tree shows two corresponding values of -99, the value assigned for missing.
ReadData3.C (8.48 KB)
SampleData.txt (479 Bytes)
SampleData.root (5.52 KB)

Hi,

The issue is that you have: std::vector <TString> fCharValues(nvars); .... BranchDescription += "/C"; tree->Branch(fVarNames[j].Data(), (void *) fCharValues[j].Data(), BranchDescription.Data());However the branch has no way to be informed every time the TString object reallocate is memory buffer (i.e. the value ‘fCharValues[j].Data()’ varies over time).

You can solve the problem by either adding (here just for your example file):tree->SetBranchAddress(fVarNames[1].Data(),(void*)fCharValues[1].Data());but, of course, you need to call a similar line for every string branch (and for none of the float branches).

Alternatively (with ROOT v5.21/06 or newer) you can simply store the TString directly:tree->Branch(fVarNames[j].Data(), &(fCharValues[j]),32000,0);

Cheers,
Philippe

Thanks a lot, Philippe. It works. I am going to make some changes and do some more testing etc. and add some more functionality to it, and then provide the code here.
-Arun