RDataFrame list of file names

Hi,

Maybe it is a trivial question, but I can’t seem to easily locate how to actually retrieve
the list of filenames that the regexp found in a directory for an expression like

 ROOT::RDataFrame df("mytree", "/my/path/*.root");

Does such a functionality exist that allows to get the list of files processed, perhaps via a lower level interface? The idea behind is to be able to monitor for a large number of files what is happening.

Also, is there a way to get a more “verbose” output from RDataFrame when it is actually chaining/processing the files?

Thanks,
Balint


ROOT Version: 6.16.00
Platform: Mac OS 10.14.3
Compiler: Not Provided


Hi @radbalint ,
the functionality does not exist because RDataFrame does not implement glob parsing: the glob "/my/path/*.root" is just passed along to TChain with no modification. The documentation could be clearer about that, I’ll see what we can do.

To get the list of files that the glob will expand to, you can go through a TChain:

TChain c("mytree");
c.Add("/my/path/*.root");
for (auto file : *c.GetListOfFiles()) std::cout<< file->GetName() << std::endl;

If you think this workaround is clunky and RDataFrame should, in fact, implement a GetInputFiles() method, please open a feature request ticket on jira.

Regarding RDF verbose mode: it’s currently missing, and adding it is in ROOT’s plan of work for 2019.
Most probably, it will be worked on during the summer.
In the meanwhile, the quick and dirty solution is to hack a Filter to print something:

df.Filter([](ULong64_t e) { std::cout << "processing entry " << e << '\n'; return true; }, {"rdfentry_"});

while the proper, but more clunky way to do it is to use RResultPtr::OnPartialResult callbacks like you can see in the tutorial $ROOTSYS/tutorials/dataframe/df013_InspectAnalysis.C:

auto eventCount = df.Count();
eventCount.OnPartialResult(/*every=*/100,
                           [](ULong64_t e) { std::cout << "processing entry " << e << '\n'; });

This registers the lambda that prints progress to be executed every 100 entries processed by one of the threads. You can check/copy-paste the actual code in the tutorial I linked to implement a proper thread-safe progress bar (which of course comes with a little performance penalty).

Cheers,
Enrico

Thank you very much, Enrico!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.