Maybe it is a trivial question, but I can’t seem to easily locate how to actually retrieve
the list of filenames that the regexp found in a directory for an expression like
ROOT::RDataFrame df("mytree", "/my/path/*.root");
Does such a functionality exist that allows to get the list of files processed, perhaps via a lower level interface? The idea behind is to be able to monitor for a large number of files what is happening.
Also, is there a way to get a more “verbose” output from RDataFrame when it is actually chaining/processing the files?
Thanks,
Balint
ROOT Version: 6.16.00 Platform: Mac OS 10.14.3 Compiler: Not Provided
Hi @radbalint ,
the functionality does not exist because RDataFrame does not implement glob parsing: the glob "/my/path/*.root" is just passed along to TChain with no modification. The documentation could be clearer about that, I’ll see what we can do.
To get the list of files that the glob will expand to, you can go through a TChain:
If you think this workaround is clunky and RDataFrame should, in fact, implement a GetInputFiles() method, please open a feature request ticket on jira.
Regarding RDF verbose mode: it’s currently missing, and adding it is in ROOT’s plan of work for 2019.
Most probably, it will be worked on during the summer.
In the meanwhile, the quick and dirty solution is to hack a Filter to print something:
auto eventCount = df.Count();
eventCount.OnPartialResult(/*every=*/100,
[](ULong64_t e) { std::cout << "processing entry " << e << '\n'; });
This registers the lambda that prints progress to be executed every 100 entries processed by one of the threads. You can check/copy-paste the actual code in the tutorial I linked to implement a proper thread-safe progress bar (which of course comes with a little performance penalty).