Data processing with RDF, Snapshot creation, event loop possibly executed more than once

andrea.celentano · July 20, 2024, 10:08am

Dear colleagues,
I observed a behavior connected to RDataFrame that I was not able to explain. I am using the code reported in the attached file, compiled to obtain an executable. My code process a large number of files to produce few output histograms and a Snapshot for events passing all the selection criteria.

ana.cpp (2.2 KB)

When I execute the code via “./ana”, I properly see the event loop progressing, as shown by the progress bar.

|====================================>              |   [Elapsed time: 0:23m  processing file: 15 / 312  processed evts: 850000 / 1161041  9.41e+04 evt/s 0:03m  remaining time (per file being processed)]   
|======================================>            |   [Elapsed time: 0:29m  processing file: 18 / 312  processed evts: 1082000 / 1393317  8.27e+04 evt/s 0:03m  remaining time (per file being processed)]  
|=======================================>           |   [Elapsed time: 0:33m  processing file: 20 / 312  processed evts: 1237000 / 1547666  7.80e+04 evt/s 0:03m  remaining time (per file being processed)]  
|========================================>          |   [Elapsed time: 0:35m  processing file: 21 / 312  processed evts: 1315000 / 1625060  7.37e+04 evt/s 0:04m  remaining time (per file being processed)]  
...

After ~ 5 minutes, all the files are processed. Up to this point, the output file size is still close to zero (242B). After this, I see the program not reporting any further output; the output file size starts to increase very slowly, and after other ~ 5 minutes the code ends.

I do not understand this behavior (i.e., what the code is doing after the first event loop?).
Maybe I am doing something wrong and the event loop is executed more than once in my code?

Thanks,
Bests,
Andrea Celentan

ROOT Version: 6.32.02
Platform: Centos9
Compiler: 13.1.0

Danilo · July 20, 2024, 1:53pm

Hi Andrea,

Nice example (I would suggest to pass by reference the node and the string to the doHisto function perhaps).

The speed at which the memory buffers used for writing are filled, and therefore flushed on disk, depends on some factors. For example your data model, the number of threads or the efficiency of the filters. Therefore I would not be too worried about the way in which the size of the file on disk grows as a function of time (moreover, the OS is in the middle of all this).

To check the number of loops, have you tried adding something like

Filter("if(rdfentry_ == 0) cout << \"loop\\n\"; return true ")

to print something at the first event?

Let us go how it goes.

Cheers,
D

system · August 3, 2024, 1:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.