Trying to handle RDataFrames

eguiraud · April 15, 2021, 3:58pm

Cool, I could compile it against ROOT master branch with these tweaks:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 33b508b..e2e416d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -44,7 +44,7 @@ endif()

 INCLUDE_DIRECTORIES( ${ROOT_INCLUDE_DIR} )
 LINK_DIRECTORIES( ${ROOT_LIBRARY_DIR} )
-SET(EXTRA_EXTERNAL_LIBRARIES ${EXTRA_EXTERNAL_LIBRARIES} Core MathCore Hist Gui Gpad Graf Tree Rint RIO Proof TreePlayer Physics RootDataFrame RootVecOps)
+SET(EXTRA_EXTERNAL_LIBRARIES ${EXTRA_EXTERNAL_LIBRARIES} Core MathCore Hist Gui Gpad Graf Tree Rint RIO Proof TreePlayer Physics ROOTDataFrame ROOTVecOps)

 set( gray_Imager_LIB_DIR ${CMAKE_INSTALL_PREFIX}/lib )
 set( gray_Imager_BIN_DIR ${CMAKE_INSTALL_PREFIX}/bin )
diff --git a/src/gi_reader.cpp b/src/gi_reader.cpp
index 1854c59..919dfd7 100644
--- a/src/gi_reader.cpp
+++ b/src/gi_reader.cpp
@@ -19,7 +19,7 @@ void gi_reader::process_analysis()
 {
     INFO_MESS << "Analysis running..." << ENDL;

-    const auto poolSize = ROOT::GetImplicitMTPoolSize();
+    const auto poolSize = ROOT::GetThreadPoolSize();
     const auto nSlots = 0 == poolSize ? 1 : poolSize;

     const auto everyN = _fnevent/nSlots/100;

Now how do I run it the same way you run it? RDF is not 3 times slower than TSelector, we’ll make it go fast

dudouet · April 15, 2021, 4:11pm

Great,

Should I use also these modification with my current root install (6.20/00) ?

To run the program, you just have to do:

gray-imager --file MyFile_-5cm_0000.root -r --gate 661/5

or if you want to execute it without multi thread:

gray-imager --file MyFile_-5cm_0000.root -r --gate 661/5 -n

eguiraud · April 15, 2021, 4:18pm

GetImplicitMTPoolSize was deprecated and renamed to GetThreadPoolSize (I think in 6.22) – so nothing important.

I am not sure how RootDataFrame compiles for you, I need ROOTDataFrame which is the exact case-sensitive library name.

Will let you know what I find.
Cheers,
Enrico

dudouet · April 15, 2021, 4:22pm

I am working on macos, perhaps that’s why he find the way to compile anyway… I have corrected it.

I just pushed the modification with the possibility to set the number of entries to treat. You don"t need it actually but if you want you can pull it.

eguiraud · April 15, 2021, 4:23pm

ah yes, I think by default the file system is case insensitive on macos. terrible imho

eguiraud · April 16, 2021, 3:46pm

Hi @dudouet ,
I could manage to get a 3.7x speed-up on the evts/s throughput reported by your super cool progress bar on my laptop for the single-thread case.

However none of the speed-up came from RDataFrame-related fixes, and the largest part came from switching from TVector3 to ROOT::Math::XYZVector, so I cannot say why the RDF version was slower than the TSelector version: the RDF version was spending all of its time in the analysis logic anyway, I just optimized that, but I assume the analysis logic is the same in the two cases.

Before (flamegraph)

Total 1000000 evts |==>.................| 11.00% [ 1.31e+03 evts/s, time left:   0h11min08s ]

After (flamegraph)

Total 1000000 evts |============>.......| 64.00% [ 4.84e+03 evts/s, time left:   0h01min23s ]

Here are some resources on flamegraphs: basically you have call stacks on the y axis and the width of each stack on the x axis tells you what % of runtime it took. They give you a quick overview of where CPUs spend their time.

Before my changes, 17% of the time was spent in TH3::Fill (good) and 68% of the time was spent in Define lambdas, but mostly because of TVector3 constructions, destructions and copies.
With my changes, 52% of the time is spent in TH3::Fill (much better) and 36% in Define lambdas, now divided between mathematical operations between cartesian vectors, constructing cartesian vectors and calling push_back on a RVec<float> (I think because of get_int_t1t2, get_int_xpos, get_int_ypos and get_int_zpos).

Here are my patches: patches.tgz (3.7 KB) .

Now some questions:

are we fast enough now?
how important it is that this usecase scales well to more than 4 cores? there might be some room for improvement there (I can’t really measure it because the input file I have has only 6 TTree clusters, i.e. the maximum amount of parallel tasks RDF splits it into is 6)
since the bottlenecks where mostly in a single lambda function, get_proj, and runtime was pretty much all spent either there or in TH3::Fill: what is the TSelector-based version of the analysis doing differently to run so much faster?
can you please upgrade our tutorial here to use your awesome progress bar? The one I had implemented is much worse

Cheers,
Enrico

EDIT:
the line _fnevent(*fdata_frame.Count()) can be very costly, it performs a full event loop to do the count (because in general you might have a Filter before the Count)!! It’s much better to use TChain::GetEntries or TTree::GetEntries instead.

dudouet · April 19, 2021, 7:22am

Hi,

Thanks a lot for your help. I will go step by step to answer to your questions. First, I am not able to compile my code with your patches. First, I had this error:
error: 'auto' not allowed in lambda parameter
I guess this is because I need to compile the code using c++14 right ? I have tried to add it in my cmake, but I have plenty of other erros, is it because I need to compile ROOT also using c++14 ? And if yes, how can I do this ? Or do I need a root version more recent than 6.20 ?

[0/1] Re-running CMake...
ROOT found.
        --> ROOTSYS=/Users/dudouet/Softs/ROOT/install/v6-20-00
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/build
[3/7] Building CXX object CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o
FAILED: CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o 
/Library/Developer/CommandLineTools/usr/bin/c++   -I/Users/dudouet/Softs/ROOT/install/v6-20-00/include -I/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src -fpic -std=c++14 -pthread -Wno-unused-parameter -Wno-sign-compare -Wno-invalid-offsetof -Wno-unused-result  -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++  -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++ -g -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk   -std=gnu++14 -MD -MT CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -MF CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o.d -o CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -c /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:96:24: error: no template named 'integer_sequence'; did you mean '__integer_sequence'?
using index_sequence = integer_sequence<size_t, _Ip...>;
                       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tuple:93:8: note: '__integer_sequence' declared here
struct __integer_sequence {
       ^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: reference to 'integer_sequence' is ambiguous
   typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;
                                                                ^
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:89:8: note: candidate found by name lookup is 'std::integer_sequence'
struct integer_sequence {
       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/utility:868:29: note: candidate found by name lookup is 'std::__1::integer_sequence'
struct _LIBCPP_TEMPLATE_VIS integer_sequence
                            ^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: unknown type name 'integer_sequence'
   typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;

eguiraud · April 19, 2021, 7:25am

You can either get a C++14 ROOT (which you have to rebuild from scratch, you can’t change the C++ standard of an existing build) or maybe more simply you can just put the actual type of the variable instead of the auto in the lambda

dudouet · April 19, 2021, 7:32am

in this case, I got this error:

/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:62: error: no matching function for call to 'Map'
    auto get_etot       = [] (const RVecTrack &vec) { return Map(vec, [] (track_entry &i) { return i.E; }); };
                                                             ^~~
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RVec.hxx:909:6: note: candidate template ignored: substitution failure [with Args = <const ROOT::VecOps::RVec<track_entry> &, (lambda at /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:71)>]: no matching function for call to 'MapFromTuple'
auto Map(Args &&... args)

eguiraud · April 19, 2021, 7:35am

I think you need const track_entry &i (i.e. with the extra const) for it to compile (terrible template compiler errors…)

dudouet · April 19, 2021, 7:44am

ok great, it seems to work. I will try the code like this and then I will go to your different questions.

dudouet · April 19, 2021, 8:32am

I am a bit confused… I have applied your patches, and I don’t see from my side real improvement. I am surprised, because in mono thread, I have rates around 350 evts/s. You say that you have 4.8e3 evts/s in single threads… what laptop do you have !

I have created a new branch on my git, named CodeOptimization. Can you checked that we have no differences between our codes ?

eguiraud · April 19, 2021, 8:40am

Are both ROOT and your code compiled with optimizations?

I have a nice laptop but 350 evts/s is fairly terrible.

is the CPU running at 100% all the time? you can check e.g. via htop or by running the program under /usr/bin/time
if not, it’s likely an I/O bottleneck (but 350 evts/s suggests super slow network rather than slow disk). can you read that file faster than that in any other way?
if yes, can you run perf record --call-graph dwarf -F99 ./gray-imager/build/gray-imager --file data/MyFile_-5cm_0000.root -r --gate 661/5 -n, then press ctrl-C after 15 seconds or so, then run perf script > for_enrico.txt and share that file with me?

Cheers,
Enrico

EDIT:
I tried your branch, compiled against ROOT’s master branch (only had to tell cmake to compile everything with C++14 because my ROOT build is C++14). On my laptop I get 5e3 evts/s with the input file you shared.

dudouet · April 19, 2021, 8:53am

Hum, I am not sure that perf is available on macos.

that’s really strange because my macbook pro is 1 years old, and I have a ssd disk. for the root and my code installation, I did not specified any specific compilation optimization. I just compile it with the command:

cmake -G Ninja ../sources

For the root installation, I also used the standard compilation, without any specific flags

dudouet · April 19, 2021, 8:58am

ok… it was compiled in Debug option …

eguiraud · April 19, 2021, 9:03am

Argh, then it’s actually hard to say whether TSelector was faster in the first place.

dudouet · April 19, 2021, 9:17am

Yes… totally agree. Really sorry for that. Anyway this allowed you to optimize my code

I will make some more reasonable comparisons now and let you informed. For the other points:

Regarding the definition of entries to treat:
When I create my gi_reader class, if I give as input a TChain containing a TEntryList build using something like this: mychain->Draw(">>elist", "", "entrylist",N).
The rdataframe will automatically take only these entries or do I have something else to do then ? Like this is should allows to process a range of entries using or not the multi-thread right ? And like this I can easilly remove the bad *Count() in my constructor using _achain->GetEntryList()->GetN(). Correct ?
Regarding the TVector3:
I was not aware of these ROOT::Math::XYZVector. Why are they so more efficient ? Is there any reason to use TVector3 instead of these ?
Regarding the multi-thread:
what do you mean by “has only 6 TTrees clusters”. How this number is defined ? This is a property of the TTree ? Because in my real analysis, I will have to read something like 50GB of data, so using much more than 6 threads will be very important for me.
Finally, for the progress bar, no problem of course, but this is a home made progress bar class, this can probably be improved… should the class be directly included in the maco file ?

eguiraud · April 19, 2021, 9:48am

yes, if that’s not the case there is something wrong somewhere
there are a bunch of reasons why the types in GenVector are faster (see the docs) – in general they are designed as a performant substitute to the old TVector3, TLorentzVector, … classes. In your particular case, it’s because they don’t have to call the expensive TObject ctor/dtor (which I think allocates/deallocates on the heap)
a TTree cluster is a bunch of TTree entries that are zipped together. The TTree schema and the TTree settings influence how large a TTree cluster is. RDataFrame parallelizes over TTree clusters (it is the smallest number of entries that it makes sense to process in a single task, because if you split a TTree cluster between two tasks each task will have to decompress the whole cluster to read its data). E.g. rootls -t <file.root> returns the number of TTree clusters in recent ROOT versions. A 50GB file will have more clusters than that!
I think it would be nice to have a helper in RDFHelpers.hxx that works like this auto count_with_progress_bar = ROOT::RDF::AddProgressBar(df), i.e. takes a dataframe, adds a Count with the progress bar added via OnPartialResultSlot, and returns the Count result. It should be easy to add – if not, also adding your progress bar to the tutorial macro works, then I’ll move it in RDFHelpers myself.

EDIT:
note that the big difference in runtime between TVector3 and XYZVector is because you are constructing and destructing more than 50k of those per event!

system · May 3, 2021, 9:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.