JIT performance issue

Dear ROOT-Team,

I want to document a JIT performance issue.
(Solution at the end of the post.)
I compared the performance of JITed and compiled functors that are called from multiple threads.

Here the code snippet for the g++ compiled functor:

struct CompiledFunctor : public Functor {
  double operator()(double* dp) { return 1.1 * (*dp); }
  double operator()(double d) { return 1.1 * d; }
};

Please find the complete reproducer at [1].

Below are the performance results on my ubuntu 16.04 laptop using g++ 5.5 (4 threads).
Results are in ms.

ROOT v6.18/04

jit (pointer)      38899
jit (double)       1417
compiled (pointer) 1011
compiled (double)  938

ROOT v6.20/02 from [2]

jit (pointer)      17435
jit (double)       1454
compiled (pointer) 953
compiled (double)  977

This is an order of magnitude difference if a pointer is passed as an argument.

Adding R__CLING_PTRCHECK(off), or #pragma cling optimize(3) didn’t change the results.

Solution
I came across [3] and also tested the patch from @wlav. (see file cling-performance.patch in [1])
This solved the issue!

ROOT v6.20/02 + cling-performance.patch

jit (pointer)      976
jit (double)       1117
compiled (pointer) 982
compiled (double)  923
  • Do you plan to add this patch to master?
  • Is there a way to set the cling optimization options at compile time / runtime?

Many thanks for your help!

Lukas

[1] https://github.com/breitwieserCern/jit-performance-issue
[2] https://root.cern/download/root_v6.20.02.Linux-ubuntu16-x86_64-gcc5.4.tar.gz
[3] Cling JIT-ed code speed & interaction with shared libraries

2 Likes

Thanks! It looks interesting! @Axel or @vvassilev might be interested to take a look more in details and give some comments?

Thanks, Lukas! Can you change this https://github.com/breitwieserCern/jit-performance-issue/blob/aadd8e5f0093bbf72a8a13cf10dd103c87069f62/main.cc#L27 to

auto* jit = (Functor*)gInterpreter->Calc("#pragma cling optimize(3)\n new JitFunctor();");

Does that help?

Hi Axel,

Thanks for your quick reply!
Unfortunately, the suggested change has no effect on the results with v6.20/02.

I also added the pragma before I define the functor and enabled R__CLING_PTRCHECK(off). [1]
I profiled [1] with Vtune and found that the majority of the time is spent in function cling::utils::platform::IsMemoryValid.
Shouldn’t R__CLING_PTRCHECK(off) disable these checks?

The following screenshot shows the vtune hotspot analysis filtered for the “jit (pointer)” benchmark.

[1] https://github.com/breitwieserCern/jit-performance-issue/commit/cfd1a3c56b3dec30a89e3d6b41ccf455060e3356

You need #pragma cling optimize when code generation happens - and that happens when the function symbol is needed, i.e. at Calc() .

Disabling the pointer check is done on scope level; try this:

struct R__CLING_PTRCHECK(off) JitFunctor : public Functor { 
      double operator()(double* dp) { return 1.1 * (*dp); }
      double operator()(double d) { return 1.1 * d; }
    };
};

Thanks for the clarification!
Unfortunately, your suggestion does not impact the runtime.

Adding this patch to cling is a little tricky. We used to have optimization level -O2 in the past for a while. It turns out that the optimizer made most of the code 50% slower on average because it was not heavily used and the optimizer invested a lot of time optimizing. That is why the #pragma cling optimize is very useful to annotate the heavily used code.

If we can conditionally enable/disable the inline pass that would likely solve the issue with #pragma cling optimize

Well, but Lukas has reported here that #pragma optimize doesn’t work, and we must fix that. I’ll get to it, either today or in 10 or so days (off next week…)

From vtune’s output it seems that R__CLING_PTRCHECK(off) should have a large impact though. Could it be that’s what does not work?

1 Like

Yes, would be nice if the advertised workarounds do workaround :wink:

I can reproduce the bug.

What about this:

$ ./jit 
jit (pointer)      423
jit (double)       425
compiled (pointer) 410
compiled (double)  427

main.cc (1.1 KB)

I.e. as I said, #pragma optimize must appear where code generation happens, i.e. at the point of use of a symbol. R__CLING_PTRCHECK(off) is an annotation for the scope containing the function that should not see pointer checks. Declare actually disables pointer checks altogether.

The issue is that the code is passed through ProcessLine (or one of its siblings), and cling thought “it’s an expression”, and that doesn’t bother to look for R__CLING_PTRCHECK as that’s a declaration thing.

Your options:

  • use Declare (vividly recommended)
  • wait for me to fix the interplay between the pointer check and cling’s extraction of declarations from cling’s evaluation wrapper.

Cheers, Axel.

1 Like

Hi Axel and vvassilev,

Many thanks for looking into this!

@Axel:

I can confirm that runtimes are almost equal with your modifications.

There is one drawback using Declare.
All required headers must be included.

While this is no problem in the reproducer, my real world use case is a bit more complex.
The body of the functor depends on user input and might contain user-defined types.
The definitions of these types are available in dictionaries that are loaded into cling.
gInterpreter->ProcessLineSynch seems to take this information into account and compiles the functor without any include statements.

Is there a JIRA ticket to follow the progress of option 2? :slight_smile:

@vvassilev
I am not sure if I understood the 50% slow down correctly.
Does the cost of running the optimizer outweigh the performance gains of the generated code?

Cheers,
Lukas

Yes, the cost of running the optimizer showed that it outweighs the gains of better code. This is a classical JIT trade-off.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Jira issue is https://sft.its.cern.ch/jira/browse/ROOT-10707