Reducing TTree/ntuple output size

My Analysis code saves a bunch of variables to a TTree, and when I run over large amounts of data, the output ntuples get as large as 20GB. In my current working setup (lxplus), these are simply to large to work with. For example, A macro using TTReader to access the data will take many hours to perform a simple task.
The ntuples are created following this guide:
https://atlassoftwaredocs.web.cern.ch/ABtutorial/basic_trees/
What can I do to make them more manageable without losing data? Is their a compression scheme available?

Hi @abunka ,
ROOT compresses data by default. You can change the compression algorithm, and the more you compress, the smaller the file will be, the slower I/O will be (algorithms that compress more typically also take more time to compress/decompress).

I think there are two questions behind your question:

  1. how large would you expect the output to be, ignoring compression (which increases runtimes anyway)? E.g. if you are saving 1 million events and each event contains 100 floats, that requires 400 MBytes (again, before compression)
  2. what is the bottleneck when reading the data back? 20GB of data should not require many hours to be processed when executing a simple task on multiple cores (unless you are using a single core and a slow compression algorithm). Maybe you are reading data inefficiently over the network, e.g. you are reading from /eos directly instead of going through xrootd (root://eosuser.cern.ch/...)? Maybe you can limit the processing to just 100k events for simple tests using TTreeReader::SetEntriesRange? Or you can use RDataFrame to easily run on multiple cores.

I hope this helps,
Enrico

Thanks for the reply,
Looking at the number of floats/size in a test run, I get a file of 10MB from 81 floats and 10,000 events, which is about 3 times what I would expect for just the floats. Is it reasonable that the “supporting structure” of the file takes up that much space?
The EOS thing is spot on, I didn’t even realize there was a different way to read in apart from simply giving the eos location.

Uhm, are these floats or arrays of floats? :grinning_face_with_smiling_eyes: 10k x 81 floats should take ~3MB uncompressed, but even less after compression. Otherwise no, the file size is not normal, there is something to be understood.

Attach the output of: yourTree->Print();

Good catch, they are vectors of floats. The output of Print() is attached.
myTreePrint.txt (25.7 KB)

Ok, is the file size reasonable now? :grinning_face_with_smiling_eyes:

Yes, it makes much more sense… as far as reading in to look at the data goes, I had not heard about the xrootd before. Currently, I do:

TFile *mainInput = TFile::Open("/eos/home-a/abunka/.../file.root");
TTreeReader myReader("VarNTuple", mainInput);

How would I change this to more efficiently access my data?

It should be root://eosuser.cern.ch//eos/home-a/abunka/....
But if you need to run quick tests, the largest win would come from limiting the processing to just some of the events, e.g. with myReader.SetEntriesRange(0, 10000).

Cheers,
Enrico

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.