pyROOT loop not filling tree as expected

Hi everyone,

I’m not sure if I stumbled upon a bug, but when attempting to create a branch on an existing tree and then fill that branch with data, something strange happens to the data that is written to the root file. For example, the stat box on a histogram for the original data is:
Entries: 3229843
Mean: 2001
Std Dev: 1914

while the data written to the leaf produces a histogram with the following stats:
Entries: 3229843
Mean: 4.507e-43
Std Dev: 1.319e-42

I followed the examples here: https://wiki.physik.uzh.ch/cms/root:pyroot_ttree, and can run them fine on my machine. Below is the section of problematic code:

file_out = TFile.Open("/path/to/file.root", 'update')
tree1 = file_out.Get("TCluster") #get TCluster tree

#puts data stored in a list of lists back into original ordering
clustMod_orig = sorted(clust, key = lambda x: (x[3]))

#declare a 1D array to be used as a pointer
clustADCs = array('f', [0])

#Create branch to be written
branch = tree1.Branch("clustADCs", clustADCs, 'clustADCs[nclust]/F')
for ii in range(0,nev):
	clustADCs[0] = clustMod_orig[ii][2]
	branch.Fill()

file_out.Write("", TFile.kOverwrite)
file_out.Close()

I have also tried printing the data as it is filling the branch in this loop, and the correct numbers are being assigned to clustADCs[0]. When filling an empty array with the data, the correct data is also stored, so I am not sure why the branch is not filling properly. Does anyone know why this is occurring?


ROOT Version: 6.14/00
Platform: MacOS High Sierra, v.10.13.4
Compiler: gcc version 5.1.0


Hi @sbutalla
Your branch clustADCs is intended to be just a float or an array of floats?

You declare it as an array of floats of size nclust:
branch = tree1.Branch("clustADCs", clustADCs, 'clustADCs[nclust]/F')

but you fill it as if it were just a float:
clustADCs[0] = clustMod_orig[ii][2]

Is nclust another branch of the tree? Wouldn’t you rather declare your new branch as:
branch = tree1.Branch("clustADCs", clustADCs, 'clustADCs/F')

On the other hand, are you sure you need to write floats or rather doubles?
branch = tree1.Branch("clustADCs", clustADCs, 'clustADCs/D')

Cheers,
Enric

Hi Enric,

Thank you for the help! This has seemed to fix the issue. After processing the data, however, there are values from the histogram that are missing. I realized that the number of entries in the tree do not correspond to the number of entries in the clustADCs leaf of the tree. For instance, while there are 199977 entries in the evtID leaf, there are 3358152 entries in the clustADCs leaf, which should be the proper number to run the loop over. I have been searching but cannot find a way to get the number of entries from a leaf object so that the loop will run over the proper number of entries. Any advice on this would be greatly appreciated!

Thank you,

Stephen

Hi @sbutalla

To iterate over the entries of a tree, you need to do:

for event in tree:
  print(tree.clusterADCs)

That should print as many clusterADCs as you wrote in your previous script (nev). The number of entries of a tree is obtained with t.GetEntries().

Hi Enric,

Ultimately the purpose of this program (only a snippet was provided in the original post) is to extract data from another root file, modify the contents of the clustADCs leaf and then write that data to a cloned root file without the original clustADCs leaf. At the beginning of the program, the original file is opened and the command nev = tree.GetEntries() returns the number of events (199977), which I presume is from the evtID leaf in that tree. However, the number of entries stored in the clustADCs leaf in the original file is 3358152, which would be the proper number of entries to loop over. Is there a way to retrieve the number of entries in the original leaf?

Thank you,

Stephen

Hi @sbutalia,

The result of tree.GetEntries() does not correspond to any leaf, it is the number of entries of the tree. All the branches/leaves in a tree have the same number of entries, which is the value returned by tree.GetEntries(), in your case 199977, I understand.

What can happen is what is described in this other post:

If a branch is a variable sized array and you plot a histogram of that branch, the number of entries in the histogram can be different from the number of entries in the tree, but these are two different things. How do you obtain 3358152? Is it what you see in the histogram plot for the clustADCs branch?

Enric

Hi Enric,

That clarifies a lot, thank you!

When I run the code

import ROOT
from ROOT import TTree

file = TFile('myFile.root')
tree = file.TCluster

entries = 0
clustEntries = 0

for entry in tree:
   entries += 1
   for jj in entry.clustADCs:
      clustEntries += 1

And then return the values for each, I get entries = 199978 and clustEntries = 3358152. The value of clustEntries matches the number of entries in the histogram. So this is an easy fix then: just run a loop that returns both the number of entries in the tree and the leaf of interest. In order to extract the data I followed the tutorial at http://lcgapp.cern.ch/project/pi/Examples/PyAIDAProxy/examples/hippoDemo.html, but it appears that this link is no longer working. The code is very similar to the code reproduced in this post: Reading Values using PyROOT, which was based on code at a similar link (http://lcgapp.cern.ch/project/pi/Examples/PyAIDAProxy/examples/readTree.py, which is also no longer working). From your previous comments, each event does contain a variable array with multiple entries, and the number of clusters for each event are specified in the nclust leaf in the tree. I have been looking for a solution but have found no simple way to run a loop to extract the array of values in the clustADCs branch when running the loop over the events. For instance, the .GetValue() function only returns the first element of the array stored for each event. Is there a way to simply extract the array with a function?

Thank you,

Stephen

Hi @sbutalla

At every iteration, entry.clustADCs is your array of values for that entry. Also, at every iteration, entry.clustADCs is refilled with new values. So if by extracting the array you mean saving its values somewhere, you can do something like this:

import ROOT
from ROOT import TTree

file = TFile('myFile.root')
tree = file.TCluster

entries = 0
clustEntries = 0

for entry in tree:
   mylist = []
   for val in entry.clustADCs:
      mylist.append(val)
   # Here mylist contains all the values of your array

Or you could use an array.array or numpy array instead, whatever you like, to do the copy before the next iteration replaces the previous values.

Enric

Also note that you can access the entries of the array individually, at every iteration you can do e.g. print(entry.clustADCs[0]).

Hi @etejedor,

Thank you! I was able to extract these values with no problem. However, there is now an issue with writing these values to the tree. Since this is a variable array that needs to be written, I have tried using the following code, which extracts the number of clusters (to be used as the range for the nested for loop when looping over all of the entries in the tree) and then attempts to pass zeroes to the variable clustADCs declared in the branch, and then write them to the tree (to see if this produces the expected output of a histogram with ~3 million entries with zero):

file_out = TFile.Open("myfile.root", 'update')
tree = file_out.Get("TCluster")
branch = tree.Branch("clustADCs", clustADCs, 'clustADCs[nclust]/F')

#Extract values in nclust leaf
nclust = []
for entry in tree.nclust:
	nclust.append(entry)

index = []
for jj in range(0,199977):
	index = nclust[jj]
	print(index)
	for bb in range(0,index):
		clustADCs[0] = 0
		branch.Fill()

file_out.Write("", TFile.kOverwrite)
file_out.Close()

There is an issue with the nclust branch. I have tried many different ways to write this loop, but I can only get it to work using the method at Reading Values using PyROOT. When running the loop to get the value of nclust, I receive the following error: TypeError: 'int' object is not iterable. However, if the other method is used and the nclust values are extracted, I can run the code:

for jj in range(0,199977):
	index = nclust[jj]
	print(index)
	for bb in range(0,int(index)):
		clustADCs[0] = 0
		branch.Fill()

where nclust is an array of floats cast to integers using nclust = numpy.array(nclustValues, dtype = int). When checking to see if the data were successfully written, a histogram with a mean of ~10^30 is produced which is certainly incorrect. Do you have any ideas or suggestions on how to correct this?

Best regards,

Stephen

Hi @sbutalla,

Isn’t nclust an integer branch of your tree? If you run tree.Print(), what does it tell you about the type of nclust?

That would explain why you get the TypeError you get here:

 for entry in tree.nclust:
	nclust.append(entry)

you should do instead, to get the nclust for every entry:

 for entry in tree:
	nclust_list.append(entry.nclust)

you can check the length of the nclust_list list of integers at the end, it should be the number of entries.

After that, you want to create a new branch of type array whose size for every entry is an integer of nclust_list?

Hi @etejedor,

That does in fact work to retrieve the integer values and it does return the proper length. I think I understand now why the previous loop did not work. That’s correct; I want to create a branch clustADCs of type float array whose size for every entry is in the nclust_list, and will be filled with the modified data from the original file.

Thank you,

Stephen

Hi @sbutalla

Good to hear it is now working. In order to create your new branch, you can follow this example (it’s in C++, but should be equivalent):

https://root.cern.ch/root/html534/guides/users-guide/Trees.html#adding-a-branch-to-an-existing-tree

In your case the branch will be linked to an array. You can declare an array.array variable in Python with a size that should be greater or equal than the maximum of nclust.

tree.Branch("your_array_branch", your_array_variable, "your_array_branch[nclust]/F")

Enric

Hi @etejedor,

I have followed those instructions, using the array module method to declare the nclust array:

nclust = array('i')
for i in range(maxClust):
	nclust.append(0)

and unfortunately I cannot get the branch to fill correctly. So far I have the following code, which attempts to create the variable-length array for each “event,” then write that to the tree using clustADCs[0] as a pointer and then the branch.Fill() method. I am trying to fill with just zeros first to see if I can reproduce a histogram with zero mean and zero standard deviation before writing any actual data to the branch. Below is the code:

for entry in tree:
	nclustValues.append(entry.nclust)
	events += 1

maxClust = max(nclustValues)
nclust = np.zeros(maxClust, dtype = int)
clustADCs = array('f',[0])
branch = tree.Branch("clustADCs", clustADCs, 'clustADCs[nclust]/F')

for jj in range(0,events):
	index = nclustValues[jj]		
	temparray = []
	for bb in range(0,index):
		temparray.append(0)
	clustADCs[0] = temparray
	branch.Fill()

I am receiving the following error: TypeError: a float is required for the line clustADCs[0] = temparray, which is strange because even after declaring temparray = array('f', [0]) this same error occurs. Further, I have also tried passing a list of floats to the only element of clustADCs and this works just fine.

Is there a subtlety that I am missing when filling this branch? Any suggestions are greatly appreciated.

Thank you,

Stephen

Hi @sbutaglia

In your example, the array that has the right size to host the content that will be written is the numpy array, nclust. Note how clustADCs has size 1, and not max(nclustValues). Therefore, you need to do:

branch = tree.Branch("clustADCs", nclust, 'clustADCs[nclust]/F')

Then inside the loop, at every iteration, you need to assign values to nclust before running branch.Fill().

Also please check the link with the example I shared in my last comment, after the loop you need to Write to the tree to have the changes written.

Hi @etejedor,

I have implemented this and it works! Thank you for the help. I think I was confused about how the data was being written to the branch.

When I write only zeros or ones to the branch, for instance, it produces a correct histogram:


(I think the mean is slightly lower than 1 due to a floating point error).

However, when actual data is written, it produces a histogram with a strange mean and standard deviation (the mean should be around 2000):

I am filling the branch as suggested, using the following code:

nclust = array('f',[0])
for i in range(maxClust):
	nclust.append(0)

branch = tree.Branch("clustADCs", nclust, 'clustADCs[nclust]/F')

for jj in range(0,events):
	index = nclustValues[jj]		
	for aa in range(0,index):
		print clust[jj][aa]
		nclust[aa] = clust[jj][aa]
		print nclust[aa]
	branch.Fill()

file_out.Write("", TFile.kOverwrite)
file_out.Close()

Using the print statements in the loop, I can verify that the correct data is being pulled from clust[jj][aa] and is correctly being assigned to nclust[aa], but the branch is not being written correctly for some reason. Is there something I am missing with this loop?

Thank you,

Stephen

The code looks fine, I would try these things:

  • Once you have written the new branch to the file, if you run another Python script that reads the new array branch from the tree:
for entry in tree:
  for elem in entry.clustADCs:
    print(elem)

Does it print wrong values?

  • Try writing doubles instead of floats: nclust = array('d',[0]) and branch = tree.Branch("clustADCs", nclust, 'clustADCs[nclust]/D')
  • Try writing a fixed size array branch:
    branch = tree.Branch("clustADCs", nclust, 'clustADCs[10]/D')

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.