RDataFrame: Force writing data frame with zero entries to file

Dear Experts,

I need to post-process several tuples, and for that I am using RDataFrames. In the post processing I want to create new branches, remove others and merge several files. This post-process is applied to hundreds of ntuples, from which some of them are empty.

In cases where the input tree is empty, the Snapshot method will message a Warning, but it will not write the structure of the dataframe (with zero entries) to the file. This of course is inconvenient when then running several of these outputs, because I would chain all these files.

Is there a way to force the creation of the TTree in the file when calling the Snapshot method?

This as well happens when creating a RDataFrame, applying some filtering that leads to zero events, and then trying to save it. Below I leave a short script showing this.

import ROOT

# Create a simple RDF with 100 entries
n = 100
df = ROOT.RDataFrame(n)

# Define some new columns
df = df.Define("x", "rdfentry_")  # just entry index
df = df.Define("y", "x * x")      # square of entry index

df3 = df.Filter("x>100")

# Save the dataframe to a ROOT file
df.Snapshot("output", "myFile.root", ["x", "y"])
df3.Snapshot("output", "myFile3.root", ["x", "y"])

Is there some sort of trick to save the RDataFrame?

Best,
Francisco


ROOT Version: >6.32


I don’t know if it’s possible to save the empty dataframes, but if it isn’t I would suggest a workaround: check if the dataframe is empty and if so, add just one entry with values (always the same, at least for the same variable) that you know are clearly impossible in the ‘real’ dataset, so that you can easily filter out these events later; e.g., if x and y are positive, fill the entry with -9; or you could add another column as a flag, a boolean for instance, signalling that this event should be ignored (but you’d have to add this column to all other entries in all dataframes, to mark them as ‘usable’).

Hi @fsili,

I have tried your reproducer, with current master branch of ROOT. Both files are saved and I can open them again and see the columns you saved - in myFile3.root as expected, the RDF is empty but I can see that the columns “x” and “y” are there.

>>> import ROOT
>>> newdf = ROOT.RDataFrame("output","myFile3.root")
>>> newdf.Display().Print()
+-----+---+---+
| Row | x | y | 
+-----+---+---+
>>> newdf1 = ROOT.RDataFrame("output","myFile.root")
>>> newdf1.Display().Print()
+-----+---+----+
| Row | x | y  | 
+-----+---+----+
| 0   | 0 | 0  | 
+-----+---+----+
| 1   | 1 | 1  | 
+-----+---+----+
| 2   | 2 | 4  | 
+-----+---+----+
| 3   | 3 | 9  | 
+-----+---+----+
| 4   | 4 | 16 | 
+-----+---+----+

I suggest that you first update your ROOT version to 6.36.02 (latest stable) and check if the issue is solved. Otherwise, maybe I don’t fully understand your problem.

Cheers,
Marta

1 Like

Hi @mczurylo ,

Thank you very much for your response. Yes, indeed with 6.36.02 I was able to get it to work now. I was using ROOT version 6.34 previously.

I will update the version then and everything should be working.

Thanks a lot!!!

Best,
Fran

1 Like