TTree random access

I need to shuffle entries in a (quite large, even ~10GB) TChain, so I wrote the following piece of code:

TChain ch; 
// adding files...
auto nentries = ch.GetEntries();
std::list<int> l(nentries);
std::iota(l.begin(), l.end(), 0);
std::vector<std::list<int>::iterator> v(l.size());
std::iota(v.begin(), v.end(), l.begin());
std::shuffle(v.begin(), v.end(), std::mt19937{std::random_device{}()});

auto shuffled_ch = ch.CloneTree(0);
for (const auto& i : v) {
    ch.GetEntry(*i);
    shuffled_ch->Fill();
}

This obviously has a really poor performance, so I’d like to try something to speed it up. I read in this old post about TTree::LoadBaskets and TTree::SetMaxVirtualSize (I work on a machine with a quite big RAM disk). Unfortunately TChain::LoadBaskets does not exist yet (and I don’t want to call TChain::Merge) and I’m not sure what TChain::SetMaxVirtualSize, which is inherited as it is from TTree, does. Can someone give me some help?

In Shuffling TTrees (together) I have suggested to do it in 2 steps:
a) split randomly into many “small-enough” trees (20 was my suggestion, but you could probably use 40 - that would give you 40 trees with 250 MB each)
b) shuffle these trees individually, using LoadBaskets

At the end, join the trees or add them to your TChain.

I think that should work :slight_smile:

By the way: why are you using a std::list here?! Just put the indices in the std::vector. Also, GetEntries returns a Long64_t, so you might want a std::vector<Long64_t> v and ranges::iota(v, 0ll) instead of int.

Unfortunately this approach does not work with my use case. At the beginning of everything I have N trees (stored in different files), and each tree holds a “different type” of event (but obviously all the trees share the same structure, with “different type” I mean intrinsic features e.g. how the event was generated). Now I need to shuffle between these “types”, that’s why I need to put all the entries in the same box at the same time and then pick a random one, so shuffling events in each single file it’s useless here. Also, each tree may have a different number of entries, so I can’t simply read sequentially from each of them and randomly fill another set of trees… I need to think about something smarter.

Absolutely, that was just a stupid copy-paste from a cppreference.com example :upside_down_face:

Sorry, I don’t understand the why it wouldn’t be working - you should have a completely random order afterwards. Have you looked at the code in my link? You split randomly into n files f_1 to f_n. Every event has 1/n chance to be in file f_i. You can do this efficiently as you iterate over your source tree just once and in order. You just need to open n output files at the same time. Shouldn’t be a problem these days…

After this step, every file contains approx 1/n of the total events.

Only then you shuffle the events in these files (which you can preload in RAM) individually (that’s not in the linked code).

Sorry, you are right, my brain wasn’t working properly yesterday. Well, in my case I would be happy just with adding together the n cloned trees, I guess the events in it should be already randomised as I need… So no need to save them to file too. I’ll implement that this afternoon, thanks a lot for your help and patience!

Why do not create a vector of indices [0, number of events in a tree]
then randomly shuffle the vector
then GetEntries() using the indices from the vector

Again, I was wrong in my last post, I need to shuffle each tree individually as you suggested. Here’s my first (not working) implementation of the whole thing:

std::vector<std::string> filelist = ...
TChain ch("fTree");
for (const auto& f : filelist) ch.Add(f.c_str());
std::vector<TTree*> clonedTrees;
for (const auto& f : filelist) clonedTrees.push_back(dynamic_cast<TTree*>(ch.Clone(0)));

auto nentries = ch.GetEntries();
std::mt19937 mt(std::random_device{}());
std::uniform_int_distribution<size_t> dist(0, clonedTrees.size()-1);
for (Long64_t i = 0; i < nentries; ++i) {
   ch.GetEntry(i);
   clonedTrees[dist(mt)]->Fill();
}

std::vector<TTree*> shTrees;
for (const auto& t : clonedTrees) {
    auto localEntries = t->GetEntries();
    shTrees.push_back(dynamic_cast<TTree*>(t->Clone(0)));
    t->LoadBaskets();
    std::vector<Long64_t> ev;
    std::iota(ev.begin(), ev.end(), localEntries);
    std::shuffle(ev.begin(), ev.end(), std::mt19937{std::random_device{}()});
    for (Long64_t i = 0; i < localEntries; ++i) {
        t->GetEntry(i);
        shTrees.back()->Fill();
    }
}

TList list;
for (const auto& t : shTrees) list.Add(t);
auto shTree = TTree::MergeTrees(&list);
shTree->SetName("fTree");

But the problem here is that clonedTrees[dist(mt)]->Fill() does nothing because TChain::Fill isn’t implemented, and dynamic_cast<TTree*>(ch.Clone(0)) is still a TChain*. The only difference between my case and your example is that I don’t have a big TTree at the beginning but rather n files, and I still need to “link” each tree in clonedTrees to a global TTree when calling clonedTrees[dist(mt)]->Fill()… Any suggestion?

What a dumbass, I was using Clone instead of CloneTree

Here’s my (now working) implementation:

    auto filelist = ...
    std::vector<TTree*> ttreelist;
    std::vector<TFile*> tfilelist;
    std::vector<TFile*> outtfilelist;
    std::vector<TTree*> clonedTrees;
    for (int i = 0; i < (int)filelist.size(); ++i) {
        tfilelist.push_back(TFile::Open(filelist[i].c_str()));
        ttreelist.push_back(dynamic_cast<TTree*>(tfilelist.back()->Get("fTree")));
        outtfilelist.push_back(TFile::Open(("/tmp/t4ztmp_" + std::to_string(i) + ".root").c_str(), "RECREATE"));
        clonedTrees.push_back(dynamic_cast<TTree*>(ttreelist[0]->CloneTree(0)));
    }

    for (int i = 0; i < (int)ttreelist.size(); ++i) {
        for (auto&& t : clonedTrees) ttreelist[i]->CopyAddresses(t);
        std::mt19937 mt(std::random_device{}());
        std::uniform_int_distribution<size_t> dist(0, ttreelist.size()-1);
        auto localEntries = ttreelist[i]->GetEntries();
        for (Long64_t j = 0; j < localEntries; ++j) {
            ttreelist[i]->GetEntry(j);
            clonedTrees[dist(mt)]->Fill();
        }
        for (auto&& t : clonedTrees) ttreelist[i]->CopyAddresses(t, true);
        tfilelist[i]->Close();
    }

    TTree* shTree = clonedTrees[0]->CloneTree(0);
    for (int i = 0; i < (int)clonedTrees.size(); ++i) {
        auto localEntries = clonedTrees[i]->GetEntries();

        std::vector<Long64_t> ev(localEntries);
        std::iota(ev.begin(), ev.end(), 0);
        std::shuffle(ev.begin(), ev.end(), std::mt19937{std::random_device{}()});

        clonedTrees[i]->CopyAddresses(shTree);
        clonedTrees[i]->LoadBaskets();
        for (Long64_t j = 0; j < localEntries; ++j) {
            clonedTrees[i]->GetEntry(ev[j]);
            shTree->Fill();
        }
        clonedTrees[i]->CopyAddresses(shTree, true);
        clonedTrees[i]->Delete();
    }

The only caveat is that the original trees must be smaller than fMaxVirtualSize.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.