Dear experts,
I have been running some tests comparing the execution speed of the same code compiled by cling/clang/gcc.
In the real project, the speed differences have been O(1-2) orders of magnitude, but the example I pushed to https://gitlab.cern.ch/olupton/clingtests has been tuned to be a bit more extreme (~4 orders of magnitude).
The code in that repository should just run (sh run.sh
) on any modern-ish system with /cvmfs/sft.cern.ch mounted. It uses ROOT 6.18/04 from LCG_96b.
The output from an illustrative run of that script is shown here: https://gitlab.cern.ch/olupton/clingtests/blob/master/example_run.txt
The expression that’s evaluated is, in all cases, the function call operator of some deeply-nested template type that traverses the nested members. It should all be inlinable.
The test setup contains a shared library (compiled with gcc) that provides a factory for an instance of this type, and I also create instances of the type using cling.
I then repeatedly evaluate them to measure the execution speed.
I have enabled cling’s code optimisation (#pragma cling optimize(3)
), which helps a bit (perhaps a factor 2, can be a bit hard to see in the test above, but in any case tiny compared to the cling/gcc difference).
My first question is simple: why is cling orders of magnitude slower, and is there anything I can do to improve things?
I was also interested to see that how and when I load the shared library has a significant impact on the execution speed of the code generated by cling, provided that the types match exactly.
Cling apparently finds and re-uses symbols from the shared library and manages to use the fast gcc-compiled code. This seems to make sense, but it is a little fragile:
- If I run the same test natively on MacOS (minus sourcing the LCG view environment) then loading the library has no impact on cling’s code’s execution speed
- It seems dependent on what has previously happened in the cling session (test4 in the repository)
So my second question is: is there anything I can do to make this behaviour less fragile?
On the face of it it’s a really nice feature that could help maximise the usage of precompiled code linked/loaded into the application, but it risks being rather confusing to use if the execution speed varies by orders of magnitude depending on what order library loading and cling invocation occur in.
Thanks in advance for your help,
Olli