Trying to handle RDataFrames

Cool, I could compile it against ROOT master branch with these tweaks:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 33b508b..e2e416d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -44,7 +44,7 @@ endif()

 INCLUDE_DIRECTORIES( ${ROOT_INCLUDE_DIR} )
 LINK_DIRECTORIES( ${ROOT_LIBRARY_DIR} )
-SET(EXTRA_EXTERNAL_LIBRARIES ${EXTRA_EXTERNAL_LIBRARIES} Core MathCore Hist Gui Gpad Graf Tree Rint RIO Proof TreePlayer Physics RootDataFrame RootVecOps)
+SET(EXTRA_EXTERNAL_LIBRARIES ${EXTRA_EXTERNAL_LIBRARIES} Core MathCore Hist Gui Gpad Graf Tree Rint RIO Proof TreePlayer Physics ROOTDataFrame ROOTVecOps)

 set( gray_Imager_LIB_DIR ${CMAKE_INSTALL_PREFIX}/lib )
 set( gray_Imager_BIN_DIR ${CMAKE_INSTALL_PREFIX}/bin )
diff --git a/src/gi_reader.cpp b/src/gi_reader.cpp
index 1854c59..919dfd7 100644
--- a/src/gi_reader.cpp
+++ b/src/gi_reader.cpp
@@ -19,7 +19,7 @@ void gi_reader::process_analysis()
 {
     INFO_MESS << "Analysis running..." << ENDL;

-    const auto poolSize = ROOT::GetImplicitMTPoolSize();
+    const auto poolSize = ROOT::GetThreadPoolSize();
     const auto nSlots = 0 == poolSize ? 1 : poolSize;

     const auto everyN = _fnevent/nSlots/100;

Now how do I run it the same way you run it? RDF is not 3 times slower than TSelector, weā€™ll make it go fast :slight_smile:

Great,

Should I use also these modification with my current root install (6.20/00) ?

To run the program, you just have to do:

gray-imager --file MyFile_-5cm_0000.root -r --gate 661/5

or if you want to execute it without multi thread:

gray-imager --file MyFile_-5cm_0000.root -r --gate 661/5 -n

GetImplicitMTPoolSize was deprecated and renamed to GetThreadPoolSize (I think in 6.22) ā€“ so nothing important.

I am not sure how RootDataFrame compiles for you, I need ROOTDataFrame which is the exact case-sensitive library name.

Will let you know what I find.
Cheers,
Enrico

I am working on macos, perhaps thatā€™s why he find the way to compile anywayā€¦ I have corrected it.

I just pushed the modification with the possibility to set the number of entries to treat. You don"t need it actually but if you want you can pull it.

ah yes, I think by default the file system is case insensitive on macos. terrible imho :sweat_smile:

Hi @dudouet ,
I could manage to get a 3.7x speed-up on the evts/s throughput reported by your super cool progress bar on my laptop for the single-thread case.

However none of the speed-up came from RDataFrame-related fixes, and the largest part came from switching from TVector3 to ROOT::Math::XYZVector, so I cannot say why the RDF version was slower than the TSelector version: the RDF version was spending all of its time in the analysis logic anyway, I just optimized that, but I assume the analysis logic is the same in the two cases.

Before (flamegraph)

Total 1000000 evts |==>.................| 11.00% [ 1.31e+03 evts/s, time left:   0h11min08s ]

After (flamegraph)

Total 1000000 evts |============>.......| 64.00% [ 4.84e+03 evts/s, time left:   0h01min23s ]

Here are some resources on flamegraphs: basically you have call stacks on the y axis and the width of each stack on the x axis tells you what % of runtime it took. They give you a quick overview of where CPUs spend their time.

Before my changes, 17% of the time was spent in TH3::Fill (good) and 68% of the time was spent in Define lambdas, but mostly because of TVector3 constructions, destructions and copies.
With my changes, 52% of the time is spent in TH3::Fill (much better) and 36% in Define lambdas, now divided between mathematical operations between cartesian vectors, constructing cartesian vectors and calling push_back on a RVec<float> (I think because of get_int_t1t2, get_int_xpos, get_int_ypos and get_int_zpos).

Here are my patches: patches.tgz (3.7 KB) .

Now some questions:

  • are we fast enough now?
  • how important it is that this usecase scales well to more than 4 cores? there might be some room for improvement there (I canā€™t really measure it because the input file I have has only 6 TTree clusters, i.e. the maximum amount of parallel tasks RDF splits it into is 6)
  • since the bottlenecks where mostly in a single lambda function, get_proj, and runtime was pretty much all spent either there or in TH3::Fill: what is the TSelector-based version of the analysis doing differently to run so much faster?
  • can you please upgrade our tutorial here to use your awesome progress bar? :smiley: The one I had implemented is much worse

Cheers,
Enrico

EDIT:
the line _fnevent(*fdata_frame.Count()) can be very costly, it performs a full event loop to do the count (because in general you might have a Filter before the Count)!! Itā€™s much better to use TChain::GetEntries or TTree::GetEntries instead.

Hi,

Thanks a lot for your help. I will go step by step to answer to your questions. First, I am not able to compile my code with your patches. First, I had this error:
error: 'auto' not allowed in lambda parameter
I guess this is because I need to compile the code using c++14 right ? I have tried to add it in my cmake, but I have plenty of other erros, is it because I need to compile ROOT also using c++14 ? And if yes, how can I do this ? Or do I need a root version more recent than 6.20 ?

[0/1] Re-running CMake...
ROOT found.
        --> ROOTSYS=/Users/dudouet/Softs/ROOT/install/v6-20-00
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/build
[3/7] Building CXX object CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o
FAILED: CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o 
/Library/Developer/CommandLineTools/usr/bin/c++   -I/Users/dudouet/Softs/ROOT/install/v6-20-00/include -I/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src -fpic -std=c++14 -pthread -Wno-unused-parameter -Wno-sign-compare -Wno-invalid-offsetof -Wno-unused-result  -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++  -std=c++11 -m64 -pipe -fsigned-char -fno-common -Qunused-arguments -pthread -stdlib=libc++ -g -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk   -std=gnu++14 -MD -MT CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -MF CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o.d -o CMakeFiles/gray-imager.dir/src/gray-imager.cpp.o -c /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:96:24: error: no template named 'integer_sequence'; did you mean '__integer_sequence'?
using index_sequence = integer_sequence<size_t, _Ip...>;
                       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tuple:93:8: note: '__integer_sequence' declared here
struct __integer_sequence {
       ^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: reference to 'integer_sequence' is ambiguous
   typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;
                                                                ^
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:89:8: note: candidate found by name lookup is 'std::integer_sequence'
struct integer_sequence {
       ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/utility:868:29: note: candidate found by name lookup is 'std::__1::integer_sequence'
struct _LIBCPP_TEMPLATE_VIS integer_sequence
                            ^
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gray-imager.cpp:8:
In file included from /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.h:5:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDataFrame.hxx:20:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/RInterface.hxx:15:
In file included from /Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RDF/ActionHelpers.hxx:15:
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RIntegerSequence.hxx:100:65: error: unknown type name 'integer_sequence'
   typename ROOT::Detail::__make<_Np>::type::template __convert<integer_sequence, _Tp>;
                   

You can either get a C++14 ROOT (which you have to rebuild from scratch, you canā€™t change the C++ standard of an existing build) or maybe more simply you can just put the actual type of the variable instead of the auto in the lambda

in this case, I got this error:

/Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:62: error: no matching function for call to 'Map'
    auto get_etot       = [] (const RVecTrack &vec) { return Map(vec, [] (track_entry &i) { return i.E; }); };
                                                             ^~~
/Users/dudouet/Softs/ROOT/install/v6-20-00/include/ROOT/RVec.hxx:909:6: note: candidate template ignored: substitution failure [with Args = <const ROOT::VecOps::RVec<track_entry> &, (lambda at /Users/dudouet/Analysis/AGATA/Imaging/Analyse/gray-imager/source/src/gi_reader.cpp:170:71)>]: no matching function for call to 'MapFromTuple'
auto Map(Args &&... args)

I think you need const track_entry &i (i.e. with the extra const) for it to compile (terrible template compiler errorsā€¦)

ok great, it seems to work. I will try the code like this and then I will go to your different questions.

I am a bit confusedā€¦ I have applied your patches, and I donā€™t see from my side real improvement. I am surprised, because in mono thread, I have rates around 350 evts/s. You say that you have 4.8e3 evts/s in single threadsā€¦ what laptop do you have ! :astonished:

I have created a new branch on my git, named CodeOptimization. Can you checked that we have no differences between our codes ?

Are both ROOT and your code compiled with optimizations?

I have a nice laptop :grimacing: but 350 evts/s is fairly terrible.

  • is the CPU running at 100% all the time? you can check e.g. via htop or by running the program under /usr/bin/time
  • if not, itā€™s likely an I/O bottleneck (but 350 evts/s suggests super slow network rather than slow disk). can you read that file faster than that in any other way?
  • if yes, can you run perf record --call-graph dwarf -F99 ./gray-imager/build/gray-imager --file data/MyFile_-5cm_0000.root -r --gate 661/5 -n, then press ctrl-C after 15 seconds or so, then run perf script > for_enrico.txt and share that file with me?

Cheers,
Enrico

EDIT:
I tried your branch, compiled against ROOTā€™s master branch (only had to tell cmake to compile everything with C++14 because my ROOT build is C++14). On my laptop I get 5e3 evts/s with the input file you shared.

Hum, I am not sure that perf is available on macos.

thatā€™s really strange because my macbook pro is 1 years old, and I have a ssd disk. for the root and my code installation, I did not specified any specific compilation optimization. I just compile it with the command:

cmake -G Ninja ../sources

For the root installation, I also used the standard compilation, without any specific flags

okā€¦ it was compiled in Debug option ā€¦ :frowning:

Argh, then itā€™s actually hard to say whether TSelector was faster in the first place.

Yesā€¦ totally agree. Really sorry for that. Anyway this allowed you to optimize my code :smiley:

I will make some more reasonable comparisons now and let you informed. For the other points:

  1. Regarding the definition of entries to treat:
    When I create my gi_reader class, if I give as input a TChain containing a TEntryList build using something like this: mychain->Draw(">>elist", "", "entrylist",N).
    The rdataframe will automatically take only these entries or do I have something else to do then ? Like this is should allows to process a range of entries using or not the multi-thread right ? And like this I can easilly remove the bad *Count() in my constructor using _achain->GetEntryList()->GetN(). Correct ?

  2. Regarding the TVector3:
    I was not aware of these ROOT::Math::XYZVector. Why are they so more efficient ? Is there any reason to use TVector3 instead of these ?

  3. Regarding the multi-thread:
    what do you mean by ā€œhas only 6 TTrees clustersā€. How this number is defined ? This is a property of the TTree ? Because in my real analysis, I will have to read something like 50GB of data, so using much more than 6 threads will be very important for me.

  4. Finally, for the progress bar, no problem of course, but this is a home made progress bar class, this can probably be improvedā€¦ should the class be directly included in the maco file ?

  1. yes, if thatā€™s not the case there is something wrong somewhere
  2. there are a bunch of reasons why the types in GenVector are faster (see the docs) ā€“ in general they are designed as a performant substitute to the old TVector3, TLorentzVector, ā€¦ classes. In your particular case, itā€™s because they donā€™t have to call the expensive TObject ctor/dtor (which I think allocates/deallocates on the heap)
  3. a TTree cluster is a bunch of TTree entries that are zipped together. The TTree schema and the TTree settings influence how large a TTree cluster is. RDataFrame parallelizes over TTree clusters (it is the smallest number of entries that it makes sense to process in a single task, because if you split a TTree cluster between two tasks each task will have to decompress the whole cluster to read its data). E.g. rootls -t <file.root> returns the number of TTree clusters in recent ROOT versions. A 50GB file will have more clusters than that!
  4. I think it would be nice to have a helper in RDFHelpers.hxx that works like this auto count_with_progress_bar = ROOT::RDF::AddProgressBar(df), i.e. takes a dataframe, adds a Count with the progress bar added via OnPartialResultSlot, and returns the Count result. It should be easy to add ā€“ if not, also adding your progress bar to the tutorial macro works, then Iā€™ll move it in RDFHelpers myself.

EDIT:
note that the big difference in runtime between TVector3 and XYZVector is because you are constructing and destructing more than 50k of those per event!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.