Encoding of ROOT strings

I would like to import a tree into a numpy array using pyROOT and Python 3.6. I managed to do that except for a branch containing the names of the particles involved (one name per event). I tried with

tree.SetBranchAddress(name, addr)
for j in range(nentries):
    tree.GetEntry(j)
    col.append(addr[0])
array[name] = col

with fixed format for the strings (‘S30’ to be specific), but I obtain results like

proton\x00\x00\x00

I thought that the strings have different lengths and so I tried

for j in range(nentries):
    tree.GetEntry(j)
    for i, (name, t) in enumerate(listOfColumnsAndTypes):
        array[i].append(getattr(tree, name))

but I got UTF-8 encoding errors. I tried to change the encoding (to ASCII) and I got the same results as above. Is there any way to know the encoding used by the strings in the tree entries? I guess that this problem might also explain why my first attempt was unsuccessful.

(If it matters, the root file comes from a GEANT4 output and the tree.Print() method shows that the branch I am interested in has the Char_t type.)

@etejedor can you help?

Hi @Rdat

First I would like to mention that, since ROOT 6.14, we have a feature to read TTrees into numpy arrays. It works for trees with branches of arithmetic types, though:

my_numpy_array = tree.AsMatrix(columns = ['x', 'y'])

Unfortunately that does not apply to your case because you want to read strings as well :slight_smile:
What happens if you read the entries like this:

for entry in tree:
    print(entry.name_of_your_str_branch)

Cheers,
Enric

If I print the values (i.e. I do not store them into an array or a list), I get things like

proton
e-ton
proton
roton
proton
proton
proton
proton
proton
e-ton
e-ton
proton
proton
e-ton
proton

I didn’t notice that earlier because I have several thousands values and I simply didn’t print them: I was OK checking the automatic histogram of the ntuple by the TBrowser (which apparently looks fine to me).

Hi @Rdat

So the names of the particles you read from the TTree are correct, right?

Is the issue then only related to storing Python strings in a numpy array?

Enric

Hi @etejedor,

Actually, the names should be “proton”, “neutron”, “e-”, “e+”…, and not “e-ton”, “roton”… At least, this is what I expect from GEANT4 and what I see in the histogram of the TBrowser.TBrowser_shot

That is why I thought about an encoding problem: since everything seems fine as long as I am exclusively on ROOT, I guessed that Python expects a somewhat different structure for the strings, resulting in it reading parts of the memory surrounding the string entries and/or not correctly decoding the bytes. The issue is not simply related to the string length, because even when I store the values into 2-charachter strings I get things like “e-”, “e+”, “\x00-”, “\x00+” etc.

Hi @Rdat

Would it be possible that you share with me the input file you are using? I would like to know if this is related to the Python version (did you try if the same happens with Python2?).

Enric

Hi @etejedor

Here you go (it is quite a big file): https://drive.google.com/open?id=160Agehi-nr4RnjRWBFqGokWXtOHAfzao

I did not try with Python 2, but I am going to right now. I’ll let you know.

Thank you.

Hi @etejedor

I do have a Python 2.7 installation, but my ROOT is built for Python 3.6 and I am not sure I want to rebuild ROOT just to try that: I did it quite some time ago and I remember it was not easy for me. Moreover, if I break something, I might not be able to get the output out of GEANT4 any more, which is my primary concern… :frowning_face:

Consider using uproot to read the file into NumPy.

Hi @beojan

I will check that out, thank you very much.

Hi @etejedor

After saving the output of tree.Scan() into a file, I noticed some blank entries where the names of the particles were supposed to be, even if the TBrowser looks fine. I as starting to think that the issue is with the file itself and thus Geant4 (unless somebody finds a solution).

If that is so, thank you anyway for your efforts and help.

Hi @Rdat

I believe the e-ton appearances are due to the fact that the same buffer is reused to read protons and electrons. So when you read a proton, that buffer stores proton, but then when you write e- in the same memory address it keeps the old characters that were written there.

For instance, I see this:

>>> import ROOT
>>> f = ROOT.TFile("run0_Cu10MeV_BeThk0.5mm_E78.2MeV_D850mm_r3cm_d280mm_noshield.root")
>>> f.Detector.GetEntry(0)
87
>>> f.Detector.Name
'proton\x00\x00'
>>> f.Detector.GetEntry(136)
83
>>> f.Detector.Name
'e-\x00ton\x00\x00' 

This looks like a bug in PyROOT, I will open the corresponding ticket. A temporary workaround would be to just get a substring when the name starts with e, so f.Detector.Name[:2].

Actually, the same thing happens in C++ if you allocate an array of characters and use SetBranchAddress, the only difference is that the zero character is interpreted as end of string in C++ but not in Python.

Ticket open here:
https://sft.its.cern.ch/jira/browse/ROOT-9768

I suggest you apply for now the workaround mentioned above and substring the desired characters.

So the problem is deeper than I thought… @etejedor Thank you so much. I will do as you suggest.

Dear @Rdat ,

Thanks for starting this thread! I find this thread too valuable to be hidden as a topic in the Newbie section. Do you have objections against moving this thread to the ROOT section?

Cheers, Axel.

Hi @Axel,

No objection at all. Please, feel free to move the thread.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.