How to write faster with blocks of data?


ROOT Version: 6.28
Platform: CentOS7
Compiler: g++ 9.3.0


I’m trying to analysis with CUDA. So I put computation into GPU but leave IO task in CPU. My code is like

TFlie output_file("xx.root", "recreate");
TTree tree("tree", "xxx");
double data1, data2;
tree.Branch("data1", &data1, "data1/D");
tree.Branch("data2", &data2, "data2/D");
// compute
double data1_list[entries], data2_list[entries];
/* 
 * pseudo code
 * compute in GPU, and copy result of data1 list and data2 list from GPU
 */
ComputeAndCopyFromGPU();
// fill and write
for (size_t i = 0; i < entries; ++i) {
  data1 = data1_list[i];
  data2 = data2_list[i];
  tree.Fill();
}
tree.Write();

But I found it takes about 20 seconds to write 750 MB. I have tested our machine writes up to 1G/s in ideal condition. So I wonder whther there is more efficient method to fill and write tree?

Thank you for any kind of help.

Hi,

Thanks for the post. In general, and this is not related to ROOT but rather to all software for heterogeneous hardware, the CPUs should be kept busy when the offload to the device takes place, otherwise the utilisation of resources will always be suboptimal.

Then, to write data as fast as possible, I see two possible ways:

  1. Use faster compression algorithms and smaller compression levels (I am not sure what you are using now)
  2. Use parallel writing with TBufferMerger, e.g. see this tutorial ROOT: tutorials/multicore/mt103_fillNtupleFromMultipleThreads.C File Reference

I hope this helps.

Best,
D

Thanks for your reply. But I think this is related to ROOT.
I’m sorry that I posted too much code and make all of this confusing. I start the stop watch after copy data from GPU (line with comment // fill and write in my code), and stop after writing TTree (the last line tree.Write()). And get the conslusion that the program spends about 20 seconds to run following code:

for (size_t i = 0; i < entries; ++i) {
  data1 = data1_list[i];
  data2 = data2_list[i];
  tree.Fill();
}
tree.Write();

So it’s a ROOT related problem.
Actually I’m not sure whether it’s OK to fill events in the loop one by one. It seems too simple to be right.

I use default compression algorithm and default level. I didn’t think about this before. I will check this.

Actually I just asked about this. It seems that order of events is random between threads. That’s not what I’m searching for.

Overall, thank you so much.

Hi,

Thanks for the follow up. For the sake of clarity: in item 1) I said it’s not related to ROOT the fact that during an offload to an accelerating device the CPUs should be kept busy. There ROOT can do little, it’s a design question. Clearly, as you correctly underline, it’s ROOT’s job to be fast, and that’s what we are trying to optimise in this discussion.

As for the event ordering, there is little to be done, performance over perfect reproducibility of the event sequence has been favoured in the design, as it very often happen.

Cheers,
Danilo