I am wondering if it would be possible to implement a specific exception that is able to interrupt gracefully a RDataFrame loop. With gracefully I mean e.g. able to properly close the output file produced by Snapshot() with the events created so far.
I find very useful the option to create empty dataframes and obtain a new tree via the Define method, it is something that can simplify the creation of new trees, but the limitation is that I need to know the total number of events in advance.
I have not deeply investigated the RDataSource facility, but at a first glance it seems having the same limitation.
This is why I think that an interruption mechanism based on exceptions (I cannot figure out any other) that can be thrown from user code could be beneficial.
thanks for the suggestions, which seems to do the work, and nice to learn that you are already planning a solution for future versions.
My idea of using exceptions came from the fact that sometimes the condition to interrupt gracefully the main loop could come from some special situation occurring in the one of the end-leaves of the evaluation graph… but I admit that I should come with a more concrete case to demonstrate that the solution from Danilo is not viable.
About MT, at the moment I do not find it really beneficial for my lab-size projects. Maybe with the structure of the LHC data and the available computing infrastructures for LHC you can see a big boost, but so far for all my cases, even with moderatly intensive calculations, the bottle-neck is just the data reading from disk and I stopped trying it.
Interesting…RDF multi-threading parallelizes ROOT I/O too (e.g. different clusters of entries are decompressed in parallel), so it should be beneficial. You should see no speed-up only if one thread already saturates the disk I/O bandwidth, which is usually not the case.
Is this with data on SSD, spinning disk or read via network?
Standard PC spinning disk, I assumed that the disk bandwidth was saturated.
But do not think from my statement that I did any special benchmarks more than time-ing the command that launches the script in the two cases (no MT and 6 threads MT). Probably I can collect more accurate information in some other occasion - I always wanted to write a congratulation/feedback thread on the RDataFrame feature, but as usual “spare time” is the real issue - or we go out-of-topic and we waste real time on hypotheses.