Wrong branch type detection

umute97 · May 6, 2024, 1:41pm

Hello,
I am trying to read in a root file containing a structure like this:
wfm

w0 (containing arrays of doubles)
w1 (containing arrays of doubles)
…

meta

pos (containing arrays of ints)

For this purpose, I have the following code:

std::vector<long> *pos;
std::vector<double> *w0;
std::vector<double> *w1;
std::vector<double> *w2;
std::vector<double> *w3;
std::vector<double> *w4;
std::vector<double> *w5;
std::vector<double> *w6;
std::vector<double> *w7;
std::vector<double> *w8;
std::vector<double> *w9;
std::vector<double> *w10;
std::vector<double> *w11;
std::vector<double> *w12;
std::vector<double> *w13;
std::vector<double> *w14;
std::vector<double> *w15;
std::vector<double> *trg0;
std::vector<double> *trg1;

// List of branches
TBranch *b_bias; //!
TBranch *b_freq; //!
TBranch *b_size; //!
TBranch *b_pos;  //!
TBranch *b_w0;   //!
TBranch *b_w1;   //!
TBranch *b_w2;   //!
TBranch *b_w3;   //!
TBranch *b_w4;   //!
TBranch *b_w5;   //!
TBranch *b_w6;   //!
TBranch *b_w7;   //!
TBranch *b_w8;   //!
TBranch *b_w9;   //!
TBranch *b_w10;  //!
TBranch *b_w11;  //!
TBranch *b_w12;  //!
TBranch *b_w13;  //!
TBranch *b_w14;  //!
TBranch *b_w15;  //!
TBranch *b_trg0; //!
TBranch *b_trg1; //!

TFile *f = new TFile(title.c_str());
TTree *tree = dynamic_cast<TTree *>(f->Get("wfm"));
TTree *metaTree = dynamic_cast<TTree *>(f->Get("meta"));

// variables from input root files
tree->SetBranchAddress("w0", &w0, &b_w0);
tree->SetBranchAddress("w1", &w1, &b_w1);
tree->SetBranchAddress("w2", &w2, &b_w2);
tree->SetBranchAddress("w3", &w3, &b_w3);
tree->SetBranchAddress("w4", &w4, &b_w4);
tree->SetBranchAddress("w5", &w5, &b_w5);
tree->SetBranchAddress("w6", &w6, &b_w6);
tree->SetBranchAddress("w7", &w7, &b_w7);
etc.

The root file has been created using uproot.

When I try this, it gives me an error:
Error in <TTree::SetBranchAddress>: The pointer type given "vector<double>" does not correspond to the type needed "Double_t" (8) by the branch: w0
Same error on the other branches, of course.

So, apparently, root recognizes the branch types to be doubles instead of double vectors (which I know is not correct, since I can analyze the file just fine using python.

What am I doing wrong? Is there some way to suggest the correct branch type to root?

ROOT Version: 6.30/06
Platform: Fedora 38

Danilo · May 6, 2024, 1:58pm

Hi,

Thanks for the interesting post and welcome to the ROOT Community!
This is odd. Are you sure that the type of column w0 is really a vector<double>? I ask because you start the post by calling those arrays.

Have you tried reading the file with RDataFrame? That would be the interface we suggest to use for analysis.

Could you also share the file for us to reproduce?

Cheers,
D

umute97 · May 6, 2024, 2:06pm

Hello Danilo,

thanks for your fast reply. As you can see down below, I can read in the data file just fine using uproot:

At my setup, I am writing awkward arrays to the branches. I am assuming that uproot transforms those into native types (vectors? arrays? I really don’t know because I can’t/don’t know where to look at the types in the root file…).

I would love to share the data file but it is 50GB in size and as such, can’t easily be shared…

vpadulan · May 6, 2024, 2:19pm

Dear @umute97 ,

I am assuming that uproot transforms those into native types

In general this is not true, although for this kind of very simple types I agree it should be the case. Maybe a question for uproot developers.

I really don’t know because I can’t/don’t know where to look at the types in the root file…).

You can do it with

>>> import ROOT
>>> with ROOT.TFile.Open("myfile.root") as f:
...     tree = f.Get("mytree")
...     tree.Print()

Cheers,
Vincenzo

umute97 · May 6, 2024, 2:28pm

Hi,
did that, returned:

The number of entries amounts to what I expected to see. There should be 625000 arrays in that branch, which there are. Assuming that “/D” is the type of the branch, ROOT detects it as a "D"ouble branch type as the error message suggests?

umute97 · May 6, 2024, 2:36pm

Following up on this: just a quick back-of-the-napkin calculation:

Size of double * 1024 doubles in one array * 625000 entries

This is approximately the size that I got with the tree.Print(), so even the size checks out.

vpadulan · May 6, 2024, 2:45pm

Dear @umute97 ,

Yep, the D symbols stands for double (see the docs), and that branch is not stored as an array.

umute97 · May 6, 2024, 2:50pm

But it is, see the 3rd entry in this post. If they weren’t stored as arrays, uproot would not be able to read in the rootfile and print the arrays like that…

vpadulan · May 6, 2024, 2:53pm

Dear @umute97 ,

It depends. uproot might have logic to read back the data from disk and do the necessary manipulation to treat the doubles as arrays but only in memory (e.g. simply by referencing a range of them to make the array of one event, this is just an example I am not sure this is how it happens).

I could try to reproduce your problem if you gave me a small ROOT file produced with uproot and possibly a reproducer of how to write that dataset. Unfortunately I can’t promise I would be able to test this immediately but I would come back to you once I try.

Cheers,
Vincenzo

umute97 · May 6, 2024, 2:55pm

I’ll spin the setup up real quick and generate a small test file for you. Thanks for looking into this!

umute97 · May 6, 2024, 3:05pm

A test file can be downloaded here: https://cernbox.cern.ch/s/4mOHVGU509efdAS

(I’m a new user, so I can’t create links.)

vpadulan · May 6, 2024, 3:15pm

Dear @umute97 ,

Thanks a lot for the file! Could you also share the code to create that dataset?

Cheers,
Vincenzo

umute97 · May 6, 2024, 3:28pm

The code is embedded into our measurement framework and as such is a bit crowded, I’ll try to isolate the most important points:

with self.device:
    self.log.info("Reading %d events...", max_num_events)
    while nevts < max_num_events:
        time.sleep(0.05)
        waveforms = self.get_waveforms()
        current_nevts = len(waveforms)
        nevts += current_nevts
        data += waveforms
        self.log.info("Read %d out of %d events...", nevts, max_num_events)

self._chunk += data
self._meta_chunk += [meta] * len(data)

if len(self._chunk) >= self._chunk_size or ignore_chunk_size:
    self.log.info("Chunk full.")
    self.copy_and_save_chunks()

copy_and_save_chunks() creates copies of the chunk lists and starts a thread to save the chunks:

chunk = pd.DataFrame(chunk)
chunk = chunk.applymap(_format_waveform)
chunk = chunk.rename(columns=self._channel_map)

meta_chunk = pd.DataFrame(meta_chunk)

# Transform (meta)data to suitable format for saving
formatted = {"wfm": {}, "meta": {}}
formatted["wfm"] = {column: chunk[column] for column in chunk.columns}
formatted["meta"] = {
       column: ak.Array(meta_chunk[column]) for column in meta_chunk.columns
}

# Note: Empty string next to zip function is necessary
if self._output_file is None:
    # No rootfile created yet, so create it
    self._output_file = ur.recreate(self._output_path)
    for tree, data in formatted.items():
        self._output_file[tree] = {"": ak.zip(data)}
else:
    # Rootfile already exists, so extend it
    for tree, data in formatted.items():
        data["n"] = ak.Array([len(data)])
        self._output_file[tree].extend({"": ak.zip(data)})

If you have any questions, just ask.

Danilo · May 6, 2024, 6:22pm

Hi,

We need to reproduce the issue with a standalone script: could you condensate the code in one single file which is standalone?

Best,
D

pcanal · May 6, 2024, 6:31pm

The TTree::Print confirms that the data for that branch is not stored as collection; There is a single value per entry. Uproot must have a feature that allows the loading of single element into a collection.

umute97 · May 6, 2024, 10:13pm

TTree::Print also confirms that 625000 entries amount to ~ 5 120 000 000 bytes. This cannot be, if the entries are plain doubles.

pcanal · May 6, 2024, 10:35pm

Right … something happened to this file:

root [6] wfm->GetBranch("w0")->GetListOfLeaves()->ls()
OBJ: TObjArray	TObjArray	An array of objects : 0
 OBJ: TLeafD	w0	w0[n] : 0 at: 0x6000008f0360

So the branch is incorrectly created by the leaf is correct. This is a problem and the automatic tool will get confused.

We need to figure out, how/why they are badly produced?

How were they created?

umute97 · May 6, 2024, 10:39pm

@pcanal I refer you to my previous replies. I generate the branches in my code posted above, using uproot.

umute97 · May 7, 2024, 9:36am

Hello again,

I have wrote a quick standalone script to reproduce my problem:

import uproot as ur
import pandas as pd
import awkward as ak
import numpy as np

channel_map = {
    "CH0": "w0",
    "CH1": "w1",
}

output_path = "test.root"

def generate_random_chunk(chunk_size, length):
    """Generate a chunk of random floating point numbers of a specific length and chunk size
    on all channels specified in the channel_map."""
    chunk = []
    for _ in range(chunk_size):
        channel_data = {channel: list(np.random.rand(length)) for channel in channel_map.keys()}
        chunk.append(channel_data)
    return chunk

chunk = generate_random_chunk(10, 1024)
chunk = pd.DataFrame(chunk)
chunk = chunk.rename(columns=channel_map)

# Transform (meta)data to suitable format for saving
formatted = {"wfm": {}}
formatted["wfm"] = {column: chunk[column] for column in chunk.columns}

output_file = ur.recreate(output_path)
for tree, data in formatted.items():
    output_file[tree] = {"": ak.zip(data)}

Checking the generated data with the methods described in your previous posts, I verified that the problem occurs here, too.

pcanal · May 7, 2024, 2:19pm

@jpivarski This issue needs to be address in uproot, it is producing incorrect file. The correct leaflist needs to be recorded in the title of the branch.