Hi @dudouet ,
I could manage to get a 3.7x speed-up on the evts/s throughput reported by your super cool progress bar on my laptop for the single-thread case.
However none of the speed-up came from RDataFrame-related fixes, and the largest part came from switching from TVector3 to ROOT::Math::XYZVector, so I cannot say why the RDF version was slower than the TSelector version: the RDF version was spending all of its time in the analysis logic anyway, I just optimized that, but I assume the analysis logic is the same in the two cases.
Total 1000000 evts |============>.......| 64.00% [ 4.84e+03 evts/s, time left: 0h01min23s ]
Here are some resources on flamegraphs: basically you have call stacks on the y axis and the width of each stack on the x axis tells you what % of runtime it took. They give you a quick overview of where CPUs spend their time.
Before my changes, 17% of the time was spent in TH3::Fill (good) and 68% of the time was spent in Define lambdas, but mostly because of TVector3 constructions, destructions and copies.
With my changes, 52% of the time is spent in TH3::Fill (much better) and 36% in Define lambdas, now divided between mathematical operations between cartesian vectors, constructing cartesian vectors and calling push_back on a RVec<float> (I think because of get_int_t1t2, get_int_xpos, get_int_ypos and get_int_zpos).
how important it is that this usecase scales well to more than 4 cores? there might be some room for improvement there (I canāt really measure it because the input file I have has only 6 TTree clusters, i.e. the maximum amount of parallel tasks RDF splits it into is 6)
since the bottlenecks where mostly in a single lambda function, get_proj, and runtime was pretty much all spent either there or in TH3::Fill: what is the TSelector-based version of the analysis doing differently to run so much faster?
can you please upgrade our tutorial here to use your awesome progress bar? The one I had implemented is much worse
Cheers,
Enrico
EDIT:
the line _fnevent(*fdata_frame.Count()) can be very costly, it performs a full event loop to do the count (because in general you might have a Filter before the Count)!! Itās much better to use TChain::GetEntries or TTree::GetEntries instead.
Thanks a lot for your help. I will go step by step to answer to your questions. First, I am not able to compile my code with your patches. First, I had this error: error: 'auto' not allowed in lambda parameter
I guess this is because I need to compile the code using c++14 right ? I have tried to add it in my cmake, but I have plenty of other erros, is it because I need to compile ROOT also using c++14 ? And if yes, how can I do this ? Or do I need a root version more recent than 6.20 ?
[0/1] Re-running CMake...
ROOT found.
--> ROOTSYS=/Users/dudouet/Softs/ROOT/install/v6-20-00
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/build
[3/7] Building CXX object CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o
FAILED: CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o
/Library/Developer/CommandLineTools/usr/bin/c++ -I/Users/dudouet/Softs/ROOT/install/v6-20-00/include -I/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src -fpic -std=c++14 -pthread -Wno-unused-parameter -Wno-sign-compare -Wno-invalid-offsetof -Wno-unused-result -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++ -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++ -g -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -std=gnu++14 -MD -MT CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -MF CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o.d -o CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -c /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:96:24: error: no template named 'integer_sequence'; did you mean '__integer_sequence'?
using index_sequence = integer_sequence<size_t, _Ip...>;
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tuple:93:8: note: '__integer_sequence' declared here
struct __integer_sequence {
^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: reference to 'integer_sequence' is ambiguous
typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;
^
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:89:8: note: candidate found by name lookup is 'std::integer_sequence'
struct integer_sequence {
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/utility:868:29: note: candidate found by name lookup is 'std::__1::integer_sequence'
struct _LIBCPP_TEMPLATE_VIS integer_sequence
^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: unknown type name 'integer_sequence'
typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;
You can either get a C++14 ROOT (which you have to rebuild from scratch, you canāt change the C++ standard of an existing build) or maybe more simply you can just put the actual type of the variable instead of the auto in the lambda
/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:62: error: no matching function for call to 'Map'
auto get_etot = [] (const RVecTrack &vec) { return Map(vec, [] (track_entry &i) { return i.E; }); };
^~~
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RVec.hxx:909:6: note: candidate template ignored: substitution failure [with Args = <const ROOT::VecOps::RVec<track_entry> &, (lambda at /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:71)>]: no matching function for call to 'MapFromTuple'
auto Map(Args &&... args)
I am a bit confusedā¦ I have applied your patches, and I donāt see from my side real improvement. I am surprised, because in mono thread, I have rates around 350 evts/s. You say that you have 4.8e3 evts/s in single threadsā¦ what laptop do you have !
I have created a new branch on my git, named CodeOptimization. Can you checked that we have no differences between our codes ?
Are both ROOT and your code compiled with optimizations?
I have a nice laptop but 350 evts/s is fairly terrible.
is the CPU running at 100% all the time? you can check e.g. via htop or by running the program under /usr/bin/time
if not, itās likely an I/O bottleneck (but 350 evts/s suggests super slow network rather than slow disk). can you read that file faster than that in any other way?
if yes, can you run perf record --call-graph dwarf -F99 ./gray-imager/build/gray-imager --file data/MyFile_-5cm_0000.root -r --gate 661/5 -n, then press ctrl-C after 15 seconds or so, then run perf script > for_enrico.txt and share that file with me?
Cheers,
Enrico
EDIT:
I tried your branch, compiled against ROOTās master branch (only had to tell cmake to compile everything with C++14 because my ROOT build is C++14). On my laptop I get 5e3 evts/s with the input file you shared.
Hum, I am not sure that perf is available on macos.
thatās really strange because my macbook pro is 1 years old, and I have a ssd disk. for the root and my code installation, I did not specified any specific compilation optimization. I just compile it with the command:
cmake -G Ninja ../sources
For the root installation, I also used the standard compilation, without any specific flags
Yesā¦ totally agree. Really sorry for that. Anyway this allowed you to optimize my code
I will make some more reasonable comparisons now and let you informed. For the other points:
Regarding the definition of entries to treat:
When I create my gi_reader class, if I give as input a TChain containing a TEntryList build using something like this: mychain->Draw(">>elist", "", "entrylist",N).
The rdataframe will automatically take only these entries or do I have something else to do then ? Like this is should allows to process a range of entries using or not the multi-thread right ? And like this I can easilly remove the bad *Count() in my constructor using _achain->GetEntryList()->GetN(). Correct ?
Regarding the TVector3:
I was not aware of these ROOT::Math::XYZVector. Why are they so more efficient ? Is there any reason to use TVector3 instead of these ?
Regarding the multi-thread:
what do you mean by āhas only 6 TTrees clustersā. How this number is defined ? This is a property of the TTree ? Because in my real analysis, I will have to read something like 50GB of data, so using much more than 6 threads will be very important for me.
Finally, for the progress bar, no problem of course, but this is a home made progress bar class, this can probably be improvedā¦ should the class be directly included in the maco file ?
yes, if thatās not the case there is something wrong somewhere
there are a bunch of reasons why the types in GenVector are faster (see the docs) ā in general they are designed as a performant substitute to the old TVector3, TLorentzVector, ā¦ classes. In your particular case, itās because they donāt have to call the expensive TObject ctor/dtor (which I think allocates/deallocates on the heap)
a TTree cluster is a bunch of TTree entries that are zipped together. The TTree schema and the TTree settings influence how large a TTree cluster is. RDataFrame parallelizes over TTree clusters (it is the smallest number of entries that it makes sense to process in a single task, because if you split a TTree cluster between two tasks each task will have to decompress the whole cluster to read its data). E.g. rootls -t <file.root> returns the number of TTree clusters in recent ROOT versions. A 50GB file will have more clusters than that!
I think it would be nice to have a helper in RDFHelpers.hxx that works like this auto count_with_progress_bar = ROOT::RDF::AddProgressBar(df), i.e. takes a dataframe, adds a Count with the progress bar added via OnPartialResultSlot, and returns the Count result. It should be easy to add ā if not, also adding your progress bar to the tutorial macro works, then Iāll move it in RDFHelpers myself.
EDIT:
note that the big difference in runtime between TVector3 and XYZVector is because you are constructing and destructing more than 50k of those per event!