Cling JIT-ed code speed & interaction with shared libraries

olupton · November 6, 2019, 12:22pm

Dear experts,

I have been running some tests comparing the execution speed of the same code compiled by cling/clang/gcc.

In the real project, the speed differences have been O(1-2) orders of magnitude, but the example I pushed to https://gitlab.cern.ch/olupton/clingtests has been tuned to be a bit more extreme (~4 orders of magnitude).

The code in that repository should just run (sh run.sh) on any modern-ish system with /cvmfs/sft.cern.ch mounted. It uses ROOT 6.18/04 from LCG_96b.

The output from an illustrative run of that script is shown here: https://gitlab.cern.ch/olupton/clingtests/blob/master/example_run.txt

The expression that’s evaluated is, in all cases, the function call operator of some deeply-nested template type that traverses the nested members. It should all be inlinable.

The test setup contains a shared library (compiled with gcc) that provides a factory for an instance of this type, and I also create instances of the type using cling.

I then repeatedly evaluate them to measure the execution speed.

I have enabled cling’s code optimisation (#pragma cling optimize(3)), which helps a bit (perhaps a factor 2, can be a bit hard to see in the test above, but in any case tiny compared to the cling/gcc difference).

My first question is simple: why is cling orders of magnitude slower, and is there anything I can do to improve things?

I was also interested to see that how and when I load the shared library has a significant impact on the execution speed of the code generated by cling, provided that the types match exactly.

Cling apparently finds and re-uses symbols from the shared library and manages to use the fast gcc-compiled code. This seems to make sense, but it is a little fragile:

If I run the same test natively on MacOS (minus sourcing the LCG view environment) then loading the library has no impact on cling’s code’s execution speed
It seems dependent on what has previously happened in the cling session (test4 in the repository)

So my second question is: is there anything I can do to make this behaviour less fragile?

On the face of it it’s a really nice feature that could help maximise the usage of precompiled code linked/loaded into the application, but it risks being rather confusing to use if the execution speed varies by orders of magnitude depending on what order library loading and cling invocation occur in.

Thanks in advance for your help,

Olli

Axel · November 6, 2019, 4:32pm

Hi Olli,

Thanks for your excellent report, thank you for investing the time!

This one is easy: cling assumes that

any symbol accessible from a shared library has the same functionality as one it could compile itself;
the library’s code might be more optimized than JITed code;
grabbing an existing symbol is cheaper than JITing.

That’s why if the symbol can be found in a library it will be used by cling, even if it could compile it. I would not call this part fragile.

That does not address the slowness of compiled code. I will have a look at it (likely next week). If you find the time before, can you check the speed when built with clang 5.0 (you could take /cvmfs/sft.cern.ch/lcg/contrib/llvm/5.0/x86_64-centos7/) and with the ROOT master (/cvmfs/sft-nightlies.cern.ch/lcg/nightlies/dev3/Tue/ROOT/HEAD/)?

Cheers, Axel.

olupton · November 7, 2019, 9:25am

Hi Axel,

Thanks for the quick reply!

Firstly, I checked:

Building the shared library with clang 5.0 (I still build the test application with gcc/ROOT from LCG_96b)
Building everything with the dev3/Tue nightly

Neither of these changed the overall picture, the code loaded from the shared library is still much faster. I pushed changes to my repository to enable this (not super clean, some lines need to be [un]commented in run.sh to reproduce).

Regarding my ‘fragile’ comment, the behaviour that led me to say this is that it seems that if cling has already JIT-ed a symbol then it prefers to re-use the JIT-ed symbol, even if a library providing that symbol has been loaded since – this is what is probed by test4 in my setup.

I am worried by this because it seems to mean that it’s not enough to write a bit of code that makes sure to load relevant libraries before calling cling, the speed of the code cling returns will still depend on the global state (i.e. did someone else ask cling to JIT anything relevant earlier in the session). If I want to use gInterpreter (I don’t know a simple alternative to that) then this is out of my control.

Cheers, Olli

Axel · November 7, 2019, 10:10am

Hi Olli,

My goal is to fix the (lack of) speed of cling here - then the order / source of symbol resolution shouldn’t be as relevant anymore. I do understand that the “global picture” influencing this is surprising, but I don’t see a good way out.

Thanks for the updated “numbers”; I’ll keep you posted (next week the earliest).

Axel.

wlav · November 7, 2019, 5:35pm

I’m still patching cling for the optimization level as I don’t think it works as-is (and as you use above; it’s an ordering problem), as well as for proper use of inlining. Then, the NullDerefProtectionTransformer pass can be an absolute killer (50000x in esoteric cases), so I remove that, too. See here:

https://bitbucket.org/wlav/cppyy-backend/src/master/cling/patches/optlevel2_forced.diff

I have a few more fixes for templates as well, but I don’t that’s relevant here.

Axel · November 7, 2019, 5:56pm

Yes that’s something you can try before me looking into this: add an __attribute__((annotate("__cling__ptrcheck(off)"))) if you don’t use ROOT or R__CLING_PTRCHECK(off) if you do. You can do that on the class or the functions where you “know what you are doing” and where users cannot throw random pointers in. IIRC you already tweaked the optimization level - also when the code is actually run (which is when the JITing really happens)?

olupton · November 8, 2019, 7:32am

Hi,

Thanks for the suggestion. I tried adding R__CLING_PTRCHECK(off) “everywhere” (to every class template, member function and function template) but didn’t see any significant change.

Because I didn’t know how to apply R__CLING_PTRCHECK(off) to standard library functions I also tried replacing calls like std::invoke( m_f ) with m_f() – interestingly that reduced the cling/gcc difference by a factor of ~3. I’m not quite sure what to make of that.

Regarding optimisation, my test basically does:

#pragma cling optimize(3)
#include "test.h"
R__CLING_PTRCHECK(off) std::unique_ptr<AnyFunctor> func() {
  // make( 1.f ) returns a nested template type instance
  // Functor<float()> is some std::function<float()>-like type
  // AnyFunctor is a base class of Functor<R(T)>
  return std::make_unique<Functor<float()>>( make( 1.f ) );
}
R__CLING_PTRCHECK(off) auto const func_addr = func;

inside one call to gInterpreter->Declare and then calls gInterpreter->Calc( "func_addr" ). The difference between the test0 and test1 test cases is the optimisation level requested (0 or 3).

Cheers, Olli

Axel · November 8, 2019, 10:06am

Hi Olli,

That’s done automatically.

Thanks for trying all this; I will have a look soon.

Axel.

wlav · November 11, 2019, 5:08am

I tried ROOT master + my cppyy-cling patch and with that the results equalized, see below. What did surprise me, was that setting EXTRA_CLING_ARGS=-O2 during the build actually made things much worse for the first call each (see at bottom). I went back to v6.18.02 and that version does not have this particular problem, so seems a new modules “feature.”

6.19/01
test0 -- only cling, no optimisation, all similar speed
cling_flt_flt 247210580 ticks
cling_flt_dbl 252913002 ticks
cling_dbl_dbl 254950912 ticks
test1 -- only cling, all similar speed
cling_flt_flt 22766 ticks
cling_flt_dbl 24838 ticks
cling_dbl_dbl 25626 ticks
test2 -- loading library, cling is fast when the signature matches exactly (cling_flt_flt)
dlopenlibrary 22738 ticks
cling_flt_1.f 22604 ticks
cling_flt_2.f 23760 ticks
cling_flt_dbl 26322 ticks
cling_dbl_dbl 25984 ticks
test3 -- load the library but unload it afterwards, cling is always slow
dlopenlibrary 22768 ticks
cling_flt_1.f 22780 ticks
cling_flt_2.f 22774 ticks
cling_flt_dbl 24858 ticks
cling_dbl_dbl 24738 ticks
test4 -- same as test2 but invoking cling once before loading the library with a signature matching the library
cling_flt_flt 22718 ticks
dlopenlibrary 22716 ticks
cling_flt_1.f 22720 ticks
cling_flt_2.f 22718 ticks
cling_flt_dbl 24804 ticks
cling_dbl_dbl 24914 ticks
test5 -- same as test5 but the initial cling invocation uses a signature different to that in the library
cling_dbl_flt 22718 ticks
dlopenlibrary 22818 ticks
cling_flt_1.f 22718 ticks
cling_flt_2.f 22710 ticks
cling_flt_dbl 24634 ticks
cling_dbl_dbl 24900 ticks

and with -O2 in EXTRA_CLING_ARGS during build:

6.19/01
test0 -- only cling, no optimisation, all similar speed
cling_flt_flt 241953310 ticks
cling_flt_dbl 241842444 ticks
cling_dbl_dbl 241922820 ticks
test1 -- only cling, all similar speed
cling_flt_flt 45302 ticks
cling_flt_dbl 24636 ticks
cling_dbl_dbl 24796 ticks
test2 -- loading library, cling is fast when the signature matches exactly (cling_flt_flt)
dlopenlibrary 22754 ticks
cling_flt_1.f 45520 ticks
cling_flt_2.f 22714 ticks
cling_flt_dbl 24512 ticks
cling_dbl_dbl 24942 ticks
test3 -- load the library but unload it afterwards, cling is always slow
dlopenlibrary 22778 ticks
cling_flt_1.f 23696 ticks
cling_flt_2.f 22714 ticks
cling_flt_dbl 24638 ticks
cling_dbl_dbl 24942 ticks
test4 -- same as test2 but invoking cling once before loading the library with a signature matching the library
cling_flt_flt 22742 ticks
dlopenlibrary 22824 ticks
cling_flt_1.f 22852 ticks
cling_flt_2.f 22844 ticks
cling_flt_dbl 24648 ticks
cling_dbl_dbl 24804 ticks
test5 -- same as test5 but the initial cling invocation uses a signature different to that in the library
cling_dbl_flt 45204 ticks
dlopenlibrary 22758 ticks
cling_flt_1.f 22898 ticks
cling_flt_2.f 22892 ticks
cling_flt_dbl 24702 ticks
cling_dbl_dbl 24824 ticks

Axel · November 11, 2019, 5:59am

Not surprised, we build cling and llvm with -O3 - for good reason. I’ll look at the rest later this week.

Axel

wlav · November 11, 2019, 5:52pm

The optimization level of the build of the compiler does not affect the optimization level of the code that it subsequently compiles.

The EXTRA_CLING_ARGS during build only targets the precompiled header/modules of the ROOT libs. I do not know the new scheme yet, but in the old, the build of the precompiled header only sees what’s in allCppflags.txt as picked up by rootcling. This doesn’t contain any optimization arguments, so only Cling’s default optimization level applies.

Anyway, I redid the build with EXTRA_CLING_ARGS=-O3 and no, it doesn’t make any difference. Actually, the outliers are unstable: re-running a bunch of times shows different ones to be slower and re-trying shows them also in 6.18, so contrary to what I tought before, it’s not a 6.19 (modules) thing.

olupton · November 12, 2019, 10:10am

Thanks for your investigations, and it’s obviously promising that your patched cling manages to avoid the slowdown entirely!

I should have mentioned that the benchmark can be a little unstable (to within factors of 2 or so); this is partly why I tuned it to make the gross effect bigger.

One follow-up question, triggered by this discussion of EXTRA_CLING_ARGS: what is the outlook for being able to pass other ‘compiler’ flags to cling/gInterpreter in a ROOT/LCG[/Gaudi/LHCb] context? Specifically I am thinking about -m flags to enable FMA/AVX2/AVX512. Is this [going to be] possible without building the full ROOT/LCG stack with the same -m flags?

wlav · November 12, 2019, 3:41pm

AVX involves headers in the case of Clang, rather than builtins. It needs to be enabled as part of building the pre-compiled header, or you’re out of luck. Although vanilla ROOT can point to a different PCH (ROOT_PCH envar), rebuilding it after installation is not easy (but scriptable, so solvable) and rootcling does not respect EXTRA_CLING_ARGS (I patched that, too).

Having the PCH as part of the build (and worse, packaging it with the binary distribution) has similar problems for openmp support, the ability to switch language standard, optimizations, and most of all portability.

It looks like that in 6.19, the PCH is gone, in favor of a thing called “onepcm”. I’ve not looked into that in detail yet, but fundamentally the issues haven’t changed, unless some post-install features have now been added.

olupton · November 21, 2019, 10:20am

Hi Axel, Just wondering if you managed to take a look at this in the end? Cheers, Olli

system · December 5, 2019, 10:20am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.