Train, test, validation splitting of a root file

___Please read tips for efficient and successful posting and posting code

_ROOT Version: v6-22-00-patches@v6-22-02-155-g2c44aae445
_Platform: Linux (Ubuntu20 and CentOS7)
_Compiler: GCC for Linux 9.3.0


Anyone have any good ideas how I can split a root file into train, validation, and test files.

The current file I’m working with has 1 tree with multiple branches and I would like to split it into 3 seperate files.
What I’m currently doing is loading each branch with a reader, pushing back a vector with the value from the reader then at the end saving them all to 3 root files. This feels inneficient. I would ideally like a way to split them without having to read and push back each entry individually. For example can I take the “first entry” of a tree and save that instead of the first entry of each branch? Or even just call some kind of “split” function?

It would also help if I could generalise this to a root file with multiple directories and trees but just for a single tree is all I need for now…

Any help, advice etc greatly appreciated.

Cheers.

rooteventselector -h
rootslimtree -h
rootcp -h

What are these tools? Do they have a website? I’ve never seen these before

They come with ROOT. Just try to execute them.

Wow okay these look pretty powerful. The only issue I could see myself having: How to get every 3rd event in rooteventselector? Would I have to call it many times with the same destination but like so:

rooteventselector -f 0 -l 1 source.root train.root
rooteventselector -f 1 -l 2 source.root valid.root
rooteventselector -f 2 -l 3 source.root test.root

rooteventselector -f 3 -l 4 source.root train.root
rooteventselector -f 4 -l 5 source.root valid.root
rooteventselector -f 5 -l 6 source.root test.root

... train
... valid
... test

etc

I could of course get the number of events in the file then take the first 3rd, second 3rd, and final third but I feel its better to alternate in a TVT split.

Also is there anyway to call these in root so I can make them part of a macro or will I have to make a bash script to run what I need?

Hi @lgolino ,
rooteventselector & co. are command line tools so you can use them e.g. from bash scripts.
If you need something more than they offer, you will have to write a C++ or Python program (for instance I don’t think rooteventselector lets you pick one entry every 3).

For those more complicated tasks the most high-level but still flexible interface is RDataFrame, e.g. for the split you want to do you might write:

import ROOT

# for simplicity do not use lazy Snapshots and run 3 event loops
df = ROOT.RDataFrame("treename", "source.root")
df.Filter("rdfentry_ % 3 == 0").Snapshot("outputtree", "train.root")
df.Filter("rdfentry_ % 3 == 1").Snapshot("outputtree", "valid.root")
df.Filter("rdfentry_ % 3 == 1").Snapshot("outputtree", "test.root")

That snippet performs 3 separate loops over the data, one per Snapshot call, for simplicity – but you can also pass options to Snapshot to make it lazy and perform the 3 writes during the same read.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.