Handling "Too many open files" in RDataFrame

Dear ROOT experts,

In order to prepare my data for the fitting procedure I want to create a lot of histograms for the systematic variations. Each systematic variation for each process is stored in a separate file. So I need to create 1 histogram per file for a lot of files. I want to take advantage of RDataFrame multithreading capabilities, so I’ve written a program that boils down to this

import ROOT

ROOT.EnableImplicitMT(10)

handle_list = []
file_count = 0
for file in file_list:
    file_count += 1
    print(file_count)
    df = ROOT.RDataFrame('tree', file)
    handle_list.append(df.Histo1D(("h", "h", 10, 0, 100), "observable", "weight"))

ROOT.RDF.RunGraphs(handle_list)

file = ROOT.TFile('output.root', 'recreate')
for hist_handle in handle_list:
    hist = hist_handle.GetValue()
    hist.Write()
file.Close()

This approach fails in the following way

2123
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
2125
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ZZ_QCDJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ZZ_EWKJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ZZ_QCDJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ZZ_QCDJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
Traceback (most recent call last):
  File "create_fit_histograms.py", line 373, in <module>
  File "create_fit_histograms.py", line 345, in main
  File "create_fit_histograms.py", line 282, in get_hist_hande_list
cppyy.gbl.std.runtime_error: Template method resolution failed:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_3lCR_PFLOW
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_3lCR_PFLOW
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_3lCR_PFLOW

I wasn’t expecting files to stay open between histogram request and the graph compilation, but this is far and 2123 files seems like too much.

I’ve tried to modify the code to make it run on chunks of the initial file list

import ROOT

ROOT.EnableImplicitMT(10)

for file_list_chunk in chunk_list:
    handle_list = []
    file_count = 0
    for file in file_list_chunk:
        file_count += 1
        print(file_count)
        df = ROOT.RDataFrame('tree', file)
        handle_list.append(df.Histo1D(("h", "h", 10, 0, 100), "observable", "weight"))

    ROOT.RDF.RunGraphs(handle_list)

    file = ROOT.TFile('output.root', 'recreate')
    for hist_handle in handle_list:
        hist = hist_handle.GetValue()
        hist.Write()
    file.Close()

It runs fine for the first chunk of 1246 files, but fails on the second one on 508th file

508
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
512
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/WtJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/SingleTopJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ttbarJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/ttVJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/WtJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/convertDatasets/../SlimmedCorrected_Nov2022_nodouble/../SlimmedCorrected_Nov2022_nodouble_syst/JET/WtJET_EffectiveNP_Mixed1__1down.root can not be opened for reading Too many open files
Traceback (most recent call last):
  File "create_fit_histograms.py", line 365, in <module>
  File "create_fit_histograms.py", line 339, in main
  File "create_fit_histograms.py", line 282, in get_hist_hande_list
cppyy.gbl.std.runtime_error: Template method resolution failed:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_PFLOW
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_PFLOW
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Filter(experimental::basic_string_view<char,char_traits<char> > expression, experimental::basic_string_view<char,char_traits<char> > name = "") =>
    runtime_error: GetBranchNames: error in opening the tree tree_PFLOW

This is really strange for me

  1. Why do the files remain open after the first iteration of the for loop has ended?
  2. Why does the number of allowed open files is ~400 files lower than in the first case?

How should I go about creating such code?

Here’s also full version of the code used to get the error above create_fit_histograms_simple.py (12.9 KB). The files themselves are about ~100 GB and I can share them privately if needed.

Best regards,
Aleksandr


ROOT Version: 6.26/04

Maybe @vpadulan can help you with this.

Dear Alexsandr,

Thanks for posting.
One of the principles behind RDataFrame is laziness. This approach was adopted to allow to attach as many actions as required and execute all of them looping on the data once. In your first snippet, you initialize the RDataFrame instances, but the loop over your events, the one contained in each ones of the files, happen only when you get the result, in this case the histogram, out of the ResultPtr instances you store in handle_list.

My proposal to solve your issue would be to loop over each file and write the histogram directly in a file, for example, remixing a bit your original demonstrator:


with TFile("file1.root", "recreate") as outputfile:
    for idx, file in enumerate(file_list):
        df = ROOT.RDataFrame('tree', file)
        h_name = "h_%s" %idx
        h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
        h = h_res.GetValue() # here the loop is started
        outputfile.WriteObject(h, h_name)

I hope this helps.

Cheers,
D

Dear D,

Thank you for your reply. The laziness is exactly the feature I want to take advantage of. I don’t need to do a lot of loops in one tree (the main purpose of this feature, as I understand it), but I have previous experience with Snapshot and have seen here with Histo1D, that collecting handles to lazy actions of single loops for multiple files and then running them all at once with RunGraphs allows to make use of the RDataFrame multithready capabilities and significantly speed up the computations.
I wonder if your approach allows for this with the multiple consequetive calls to GetValue. Wouldn’t a lot of time be spent on compiling the same graph RDataFrame over and over again? If so, wouldn’t using TTree::Draw() be a better solution? But once again, I would like to solve this in RDataFrame, since it has proven to be faster for such tasks if set up right.

Best regards,
Aleksandr

Dear Aleksandr,

I agree with you: RDataFrame is the most efficient way to deal (read and write) with columnar datasets!
I prepared an example for you, that hopefully mimicks a bit better your case and reuses the same graph (jitted once) over and over. Once the fake input it’s created, the execution is blazing fast:

import ROOT

# ----- all this is to create fake input
from multiprocessing import Pool
def createFakeInput(filename):
    import ROOT
    df = ROOT.RDataFrame(128)
    df.Define("observable", "gRandom->Gaus(5)")\
    .Define("weight", "gRandom->Uniform(1)")\
    .Snapshot("tree",filename)
    return 0

def createFakeInputs(filenames):
    with Pool(8) as p:
        print(p.map(createFakeInput, filenames))

if __name__ == '__main__':

    filenames = ["file_%s.root" %x for x in range(0,128)]

    createFakeInputs(filenames)

    for filename in filenames:
        print("Processing", filename)
        df = ROOT.RDataFrame('tree', filename)
        h_res = df.Histo1D['double', 'double'](("h", "h", 10, 0, 100), "observable", "weight")
        h = h_res.GetValue() # here the loop is started

I hope this helps!

Cheers,
Danilo

Dear Danilo,

Thank you for your replies! I’ve done the following test to check our approaches.
First, I’ve created some input files based on your code:

from multiprocessing import Pool
def createFakeInput(filename):
    import ROOT
    df = ROOT.RDataFrame(128)
    df.Define("observable", "gRandom->Gaus(5)")\
        .Define("weight", "gRandom->Uniform(1)")\
        .Snapshot("tree",filename)
    return 0

def createFakeInputs(filenames):
    with Pool(8) as p:
        print(p.map(createFakeInput, filenames))

if __name__ == '__main__':
    filenames = ["input/file_%s.root" %x for x in range(0,3000)]
    createFakeInputs(filenames)

Second, I’ve tried two approches

  1. The one suggested by you (in the table I’ll call in "No RunGraphs())
import glob
import ROOT

ROOT.EnableImplicitMT(10)

max_files = 1000
filenames = glob.glob('input/*.root')[:max_files]
outputfile = ROOT.TFile("output/file1.root", "recreate")
for idx, file in enumerate(filenames):
    df = ROOT.RDataFrame('tree', file)
    h_name = "h_%s" %idx
    h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
    h = h_res.GetValue()
    outputfile.WriteObject(h, h_name)
outputfile.Close()
  1. The one described in my first post (I’ll call in “RunGraphs()”)
import glob
import ROOT

ROOT.EnableImplicitMT(10)

max_files = 1500
filenames = glob.glob('input/*.root')[:max_files]
handle_list = []
for idx, file in enumerate(filenames):
    df = ROOT.RDataFrame('tree', file)
    h_name = "h_%s" %idx
    h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
    handle_list.append(h_res)

ROOT.RDF.RunGraphs(handle_list)

outputfile = ROOT.TFile("output/file1.root", "recreate")
for hist_handle in handle_list:
    hist = hist_handle.GetValue()
    hist.Write()
outputfile.Close()

I’ve run them 3 times for

  1. max_files equal to 128 and 1000
  2. With and without ROOT.EnableImplicitMT(10)

Here are the results

# files 128 1000
No RunGraphs(), no MT 20.5 ± 0.4 s 117.1 ± 0.9 s
No RunGraphs(), MT 22.5 ± 0.4 s 134.4 ± 1.1 s
RunGraphs(), no MT 4.9 ± 0.2 s 8.9 ± 0.1 s
RunGraphs(), MT 6.3 ± 1.1 s 25.4 ± 0.1 s

So, even though the multithreading seems detrimental, using RunGraphs() allows to significantly speed up the calculations, so I would prefer to keep it.

However, I wasn’t able to replicate the problem I faced in the original post with this fake input. The following code runs perfectly fine

import glob
import ROOT

def chunkify(lst,n):
    return [lst[i::n] for i in range(n)]

max_files = 3000
filenames = glob.glob('input/*.root')[:max_files]
chunk_list = chunkify(filenames, 4)
for chunk in chunk_list:
    handle_list = []
    for idx, file in enumerate(chunk):
        df = ROOT.RDataFrame('tree', file)
        h_name = "h_%s" %idx
        h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
        handle_list.append(h_res)

    ROOT.RDF.RunGraphs(handle_list)

    outputfile = ROOT.TFile("output/file1.root", "recreate")
    for hist_handle in handle_list:
        hist = hist_handle.GetValue()
        hist.Write()
    outputfile.Close()

So what could’ve caused the problem with the files still being open? And what could be a way around it?

I think I was able to find the issue. This is the code that is closer to the actual code that I’m using. The example is a bit contrived, but it illustrates the problem. The inputs are from the first code snippet of my previous post.

import glob
import ROOT

def chunkify(lst,n):
    return [lst[i::n] for i in range(n)]

max_files = 3000
filenames = glob.glob('input/*.root')[:max_files]
chunk_list = chunkify(filenames, 4)
for chunk in chunk_list:
    handle_dict = {}
    for idx, file in enumerate(chunk):
        handle_dict[file] = []
        df = ROOT.RDataFrame('tree', file)
        h_name = "h_%s" %idx
        h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
        handle_dict[file].append(h_res)

    handle_list_to_run = []
    for handles in handle_dict.values():
        handle_list_to_run += handles
    ROOT.RDF.RunGraphs(handle_list_to_run)

    outputfile = ROOT.TFile("output/file1.root", "recreate")
    for handle_list in handle_dict.values():
        for hist_handle in handle_list:
            hist = hist_handle.GetValue()
            hist.Write()
    outputfile.Close()
print("Ended")

In the actual code I want to create as many histograms as I can and later sort them by processes and selection. So the RResultPtr<TH1D> objects are stored in the dictionary handle_dict. To run the computations themselves I need to collect all of them into one list, handle_list_to_run in my case. Later, I’m using the dictionary once again to store the histograms into the output files.
The problem seem to arise from the variable scopes in python. When the next iteration in the for chunk in chunk_list: starts, the handle_list_to_run from previous iteration persists! And it contains all of the RResultPtr<TH1D> objects keeping all of the files open. So this time I get a following error

Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libHist.so for shared_ptr<TH1D>
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
Error in <TInterpreter::TCling::AutoLoad>: failure loading library libROOTDataFrame.so for ROOT::Internal::RDF::ActionTags::Histo1D
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/throwaway_scripts/rdf_many_hists/input/file_1975.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/throwaway_scripts/rdf_many_hists/input/file_1975.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/throwaway_scripts/rdf_many_hists/input/file_1975.root can not be opened for reading Too many open files
SysError in <TFile::TFile>: file /mnt/c/Users/Alex/cernbox/IncZZ/throwaway_scripts/rdf_many_hists/input/file_1975.root can not be opened for reading Too many open files
Traceback (most recent call last):
  File "write_file_chunks_error.py", line 16, in <module>
  File "/home/alex/root/lib/ROOT/_pythonization/_rdataframe.py", line 219, in _histo_profile
TypeError: Template method resolution failed:
  none of the 4 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(experimental::basic_string_view<char,char_traits<char> > vName) =>
    TypeError: takes at most 1 arguments (3 given)
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(const ROOT::RDF::TH1DModel& model, experimental::basic_string_view<char,char_traits<char> > vName, experimental::basic_string_view<char,char_traits<char> > wName) =>
    runtime_error: GetBranchNames: error in opening the tree tree
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(experimental::basic_string_view<char,char_traits<char> > vName, experimental::basic_string_view<char,char_traits<char> > wName) =>
    TypeError: takes at most 2 arguments (3 given)
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(const ROOT::RDF::TH1DModel& model = {"", "", 128U, 0., 0.}, experimental::basic_string_view<char,char_traits<char> > vName = "") =>
    TypeError: takes at most 2 arguments (3 given)
  none of the 4 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(experimental::basic_string_view<char,char_traits<char> > vName) =>
    TypeError: takes at most 1 arguments (3 given)
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(const ROOT::RDF::TH1DModel& model, experimental::basic_string_view<char,char_traits<char> > vName, experimental::basic_string_view<char,char_traits<char> > wName) =>
    runtime_error: GetBranchNames: error in opening the tree tree
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(experimental::basic_string_view<char,char_traits<char> > vName, experimental::basic_string_view<char,char_traits<char> > wName) =>
    TypeError: takes at most 2 arguments (3 given)
  ROOT::RDF::RResultPtr<TH1D> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Histo1D(const ROOT::RDF::TH1DModel& model = {"", "", 128U, 0., 0.}, experimental::basic_string_view<char,char_traits<char> > vName = "") =>
    TypeError: takes at most 2 arguments (3 given)
  Failed to instantiate "Histo1D(ROOT::RDF::TH1DModel*,std::string,std::string)"

This seems to be fixed if I delete the list in the end of the loop

import glob
import ROOT

def chunkify(lst,n):
    return [lst[i::n] for i in range(n)]

max_files = 3000
filenames = glob.glob('input/*.root')[:max_files]
chunk_list = chunkify(filenames, 4)
for chunk in chunk_list:
    handle_dict = {}
    for idx, file in enumerate(chunk):
        handle_dict[file] = []
        df = ROOT.RDataFrame('tree', file)
        h_name = "h_%s" %idx
        h_res = df.Histo1D((h_name, "h", 10, 0, 100), "observable", "weight")
        handle_dict[file].append(h_res)

    handle_list_to_run = []
    for handles in handle_dict.values():
        handle_list_to_run += handles
    ROOT.RDF.RunGraphs(handle_list_to_run)

    outputfile = ROOT.TFile("output/file1.root", "recreate")
    for handle_list in handle_dict.values():
        for hist_handle in handle_list:
            hist = hist_handle.GetValue()
            hist.Write()
    outputfile.Close()
    del handle_list_to_run # <--- this line
print("Ended")

I still got some questions

  1. Is this the correct way to deal with the lazy execution in my case?
  2. I don’t really understand where does the limit on number of open files comes from. In this simple approach it’s about ~1200 files, on the real files with 1 histogram per file it’s ~100 files and with multiple histograms from 1 files it goes down to ~400 files. So how is this limit calculated?
  3. I can’t recreate it in this reproducer, but the behavior on the real inputs is strange:
  • With the same amount of files/histogram every new chunk is calculated slower than the previous.
  • I’ve got a print statement at the very end of the the script and it takes ~1 minute for the program to go from this statement back to the CLI input.
    And this is happening without any obvious CPU throttling or memory leaking. What might be the cause of that?

Best regards,
Aleksandr

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.