Shuffling TTrees (together)

Hi all,

I’m working on creating an exercise for students where I’d like to simulate looking at binned MC background samples and comparing them against data samples. I’d like to create a fake “data” sample by making an asimov dataset culled from the binned MC background samples.

The challenge I’m facing is that once I make an asimov dataset by selecting an appropriate number of events from each of my binned MC samples (they are binned in the variable HT), I’d like to mix up the ordering of all the events in all these partial asimov datasets so that looping over small subset of the “data” sample will be representative of the overall sample. (If I were to do something naiive like hadding them or TChaining them together, then the first chunk of events will all have the HT of the first TTree that was added, instead of having HTs that are representative of all my HT bins combined).

So I’m wondering if there is a way to randomize the ordering of events inside a TChain or TTree, analogous to std::random_shuffle. I’ve attempted to loop over a TChain or a TTree where I call GetEntry on a shuffled set of entry numbers, but this is extremely slow. Is there some more effective way to do this? Is it possible to do some random swapping of TTree entries that isn’t extremely slow?

Thanks,
John

If you need to do the “shuffling” only once then … create your sample as you do now and then copy your TTree to another ROOT file “shuffling” entries (i.e. use “GetEntry on a shuffled set of entry numbers” as you do now and save each read entry to the new file).
Afterwards, you will have an ordinary ROOT file with a “shuffled sample”.

If your tree is not too large, you could try with tree->LoadBaskets or branch->LoadBaskets. This might speed up random access massively.

Hi Wile_E_Coyote, behrenhoff

Wile_E, that’s precisely what I was doing, but it seemed unacceptably slow, even just doing it once. Even with my modest tree (~2 GB, about 400k events), it looked as though it would take many hours.

behrenhoff, I tried LoadBaskets and it seems to be running roughly 5 times as fast. It’s still orders of magnitude slower than if I try to read the TTree in order, though.

I imagine the slowdown arises from some expensive call to TFile::Seek() each time an event is called out of order. Perhaps a more clever way to shuffle the TTree would be to loop in order through the Tree, randomly selecting some events along the way, moving them to the end of the TTree, and then repeating this process many times? In any case, I think I can live with the performance now that I’ve used LoadBaskets, though it is a bit inconvenient.

Thanks,
John

Note that you can also use LoadBaskets when copying your TTree … so the creation of a “shuffled sample” can be faster, too.

Don’t understand how that could work.

Another idea is to split the tree, for example into 20 different files, something like this (untested):

auto tree = getOriginalTree();
vector<TFile*> clonedFiles;
vector<TTree*> clonedTrees;
for (size_t i = 0; i < 20; ++i) {
    clonedFiles.push_back(TFile::Open(("/tmp/split_" + to_string(i) + ".root").c_str(), "recreate"));
    clonedTrees.push_back(tree->CloneTree(0));
}
auto entries = tree->GetEntries();
std::mt19937 mt(random_device{}());
std::uniform_int_distribution<size_t> dist(0, clonedTrees.size()-1);
for (Long64_t i = 0; i < entries; ++i) {
    tree->GetEntry(i);
    clonedTrees[dist(mt)]->Fill();
}
for (auto f : clonedFiles) {
    f->Write();
    delete f;
}

Then shuffle the 20 files individually (still use LoadBaskets, but now they might be small enough to fit in the cache) and then hadd them together.

No idea how much faster/slower this might be :wink:

Hi,

The main sources of slow-down are decompression and seek.

So here are even more options of making things faster:

  • don’t compress the baskets (set the compression level to 0)
  • or set the compression window to one TTree entry, i.e. flush the baskets after each entry.
  • or fit the whole tree into memory - if you have plenty of it, because it need to fit uncompressed.

But even selecting a subset of entries and then sorting them should make things faster. I.e. you’d not randomize the entries while writing a new TTree, but you’d draw random entry numbers from the original tree, and sort those to have a more performant access pattern. If I understood your problem correctly, this should not bias your result. But it might bias the educational effort - that’s your call :slight_smile:

Cheers, Axel.

Hi all,

Just to follow up in case anybody else is interested: I didn’t think a truly random shuffle would be achieved by shuffling the files pairwise without iterating that many times, and I also tried changing the compression level, which didn’t seem to help at all, though I’m not sure if I did it correctly (I called TFile::SetCompressionLevel(0) when opening the file). I wasn’t able to figure out how to implement Axel’s second suggestion about flushing the baskets. I also wanted to keep all my entries instead of only picking subset of them, so I didn’t try that option.

However, I did manage to speed the performance by somewhere between 100 and 1000 times after finding this old thread: https://root.cern.ch/root/roottalk/roottalk02/2342.html – instead of TChaining or hadding my TTrees together, I loaded all the TFiles and TTrees separately and stuck them into a std::vector, and called LoadBaskets on all of them. Then to randomize them, I made a std::vector<std::pair <unsigned short, unsigned long long> > eventList of pairs of the form (file number, entry number), and then called std::random_shuffle on that. So instead of calling TTree::GetEntry or TChain::GetEntry on randomized entry numbers, I found that a call like this:
inTrees.at(eventList.at(iEvent).first)->GetEntry(eventList.at(iEvent).second);
made things immensely faster.

Best,
John

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.