Problem using TString::Tokenize() for Excel csv-files

Dear Rooters

When you export Excel files as *.csv, the different columns are enclosed in
quotation marks, e.g.:
“12”,“13”,“14”,“15”

Thus it is possible to contain text in different rows separated by commas, e.g.
“My first”,“My second”,“My third,fourth”,“My seventh, eighth”

Now I want to tokenize the columns and tried the following code:

   TString csv = "\",\"";
   TString str = TString(&nextline[0]);
   TObjArray *strobj = str.Tokenize(csv);
   Int_t numsep = strobj->GetEntries() - 1;

cout << "numsep = " << numsep << endl;
cout << "At(2) = " <<TObjString>At(2))->GetString()  << endl;

   delete strobj;

Sorrowly, this does not work, for the samples above the output would be:

  1. example:
    numsep = 3
    At(2) = 14

  2. example:
    numsep = 5
    At(2) = My third

However, the second output should be:
numsep = 3
At(2) = My third,fourth

Do you know the reason for this and how to solve this problem?

Thank you in advance.

Best regards
Christian

Hi Christian,

The function :

TObjArray *TString::Tokenize(const TString &delim) const
{
   // This function is used to isolate sequential tokens in a TString.
   // These tokens are separated in the string by at least one of the
   // characters in delim. The returned array contains the tokens
   // as TObjString's. The returned array is the owner of the objects,
   // and must be deleted by the user.

uses each character in “delim” as a possible separator .

Eddy

Dear Eddy

Thank you, I misunderstood the method (not looking at the source code).
In principle this is the TString substitute for “strtok()”.

However, I have the problem that sometimes more than one character is used
as separator, another example being " // ", i.e. blank-slash-slash-blank.

Do you or anybody have an idea how to solve this problem (w/o having to
use getc())?

Best regards
Christian

Hi Christian,

It is easy to attack the problem with regular expressions . I know that you do not like regular expressions, but why reinvent string parsing :

TObjArray *GetColumns(const TString &str)
{
   TPRegexp r("\"([\\w\\s,]+)\",?");

   TObjArray *colL = new TObjArray();
   colL->SetOwner();
   Int_t start = 0;
   while (1) {
     TString subStr = str(r,start);
     const TString stripStr = subStr.Strip(TString::kTrailing,',');
     colL->Add(new TObjString(stripStr));
     const Int_t l = subStr.Length();
     if (l<=0) break;
     start += l;
   }

   return colL;
}

void bla()
{
   TObjArray *col1L = GetColumns("\"12\",\"13\",\"14\",\"15\"");
   for (Int_t i = 0; i <col1L>GetLast()+1; i++)
     std::cout <<TObjString>At(i))->GetString() << std::endl;

   TObjArray *col2L = GetColumns("\"My first\",\"My second\",\"My third,fourth\",\"My seventh, eighth\"");
   for (Int_t i = 0; i <col2L>GetLast()+1; i++)
     std::cout <<TObjString>At(i))->GetString() << std::endl;
}

Dear Eddy

Thank you for this nice example. I need to understand the regexp, but it works nicely.

P.S.:For the records: std::cout needs to be replaced by:
std::cout <At(i))->GetString() << std::endl;

P.P.S.:
As I see from the preview, for some reason the html formatting destroys the correct line (maybe an regexp artifact?)

Best regards
Christian

Hi Christian,

Oops, the html chewed up my code . I will attach it .

Eddy
bla.C (861 Bytes)

Dear Eddy

Thank you, but I was already able to correct it and wanted to report it in P.S. but html chewed up my code, too :slight_smile:

Best regards
Christian