Dear ROOT experts,
I have a question regarding RDF and handling missing input files. In my work, I sometimes use data files that are stored remotely and accessed via xrootd. I create an RDataFrame from a TChain, and when adding files to the TChain, I check their availability.
Occasionally, files are available when they are added to the TChain but become unavailable at the time of analysis (e.g., when I run hist = df.Histo1D(("", "", 10, -.5, 9.5), "dummy")
and hist.Draw()
). I suspect this issue might be related to xrootd instabilities.
Here is a toy example to reproduce the problem.
import ROOT
from array import array
import os
# create dummy input files
for i in range(10):
file = ROOT.TFile(f"output_{i}.root", "RECREATE")
tree = ROOT.TTree("tree", "A simple TTree")
dummy_value = array('i', [0])
tree.Branch("dummy", dummy_value, "dummy/I")
for _ in range(1000):
dummy_value[0] = i
## check if files are available when they are added to the TChain
chain = ROOT.TChain("tree")
for i in range(10):
if not chain.Add(f"output_{i}.root",0):
raise OSError(f"file output_{i}.root is missing")
df = ROOT.RDataFrame(chain)
# delete one input file
# it simulates the xrootd instability
hist = df.Histo1D(("", "", 10, -.5, 9.5),"dummy")
canvas = ROOT.TCanvas("canvas", "Histogram Canvas", 800, 600)
ROOT printed warning Error in <TFile::TFile>: file output_8.root does not exist
, but the histogram is created from the other available files.
As far as I understand, ROOT’s internal mechanisms print warnings in such cases without raising a Python exception by default.
When I run the RDF analysis on the cluster, it appears that all jobs finish properly, but I might still be missing some input files.
I might be missing something basic, but is there a way to force a Python exception when this happens? Or some workaround?
Thanks a lot for your advice.