_ROOT Version: v6-22-00-patches@v6-22-02-155-g2c44aae445
_Platform: Linux (Ubuntu20 and CentOS7)
_Compiler: GCC for Linux 9.3.0
Anyone have any good ideas how I can split a root file into train, validation, and test files.
The current file I’m working with has 1 tree with multiple branches and I would like to split it into 3 seperate files.
What I’m currently doing is loading each branch with a reader, pushing back a vector with the value from the reader then at the end saving them all to 3 root files. This feels inneficient. I would ideally like a way to split them without having to read and push back each entry individually. For example can I take the “first entry” of a tree and save that instead of the first entry of each branch? Or even just call some kind of “split” function?
It would also help if I could generalise this to a root file with multiple directories and trees but just for a single tree is all I need for now…
Wow okay these look pretty powerful. The only issue I could see myself having: How to get every 3rd event in rooteventselector? Would I have to call it many times with the same destination but like so:
I could of course get the number of events in the file then take the first 3rd, second 3rd, and final third but I feel its better to alternate in a TVT split.
Hi @lgolino , rooteventselector & co. are command line tools so you can use them e.g. from bash scripts.
If you need something more than they offer, you will have to write a C++ or Python program (for instance I don’t think rooteventselector lets you pick one entry every 3).
For those more complicated tasks the most high-level but still flexible interface is RDataFrame, e.g. for the split you want to do you might write:
import ROOT
# for simplicity do not use lazy Snapshots and run 3 event loops
df = ROOT.RDataFrame("treename", "source.root")
df.Filter("rdfentry_ % 3 == 0").Snapshot("outputtree", "train.root")
df.Filter("rdfentry_ % 3 == 1").Snapshot("outputtree", "valid.root")
df.Filter("rdfentry_ % 3 == 1").Snapshot("outputtree", "test.root")
That snippet performs 3 separate loops over the data, one per Snapshot call, for simplicity – but you can also pass options to Snapshot to make it lazy and perform the 3 writes during the same read.