Hello,
In an old piece of code, we use in Atlas, there is an expression parser which uses MethodCall::Execute (e.g. “double ret; void *obj=…;m_method→Execute(obj,ret);")
to invoke methods of an object e.g. Muon::pt() const. Now this code was ported to a multi threaded environment and it performs horribly. Likely because MethodCall::Execute is protected internally by a TVirtualRWMutex::Lock. At least that is what shows up in profiles. There must be a better way, since RDataFrame does not suffer from this problem.
Indeed, in presence of critical sections, MT code does not really scale.
And you are right, in presence of good designs, all the power of the hardware can be squeezed out!
Maybe to solve this issue well, maybe more context is needed.
Could you perhaps expose more details about your setup and what you want to achieve? There could be several options, depending on the situation.
Hi Danilo,
for filtering we have a tool which filters events or objects based on expressions like muon.pt>20 && muon.eta>-2.4 && muon.eta<2.4.muon is a container name, pt a method of a container object, for which there is a TClass which knows about the methods e.g. pt and eta. The expression is parsed during initialization, and executed during runtime, the data is provided by the event store we use in atlas through handles. In principle TClass knows that the method exists, that the method has the correct signature i.e. double ()(const T *obj);. TClass does not know whether it would be safe to call the method in a MT context, I suppose, it only knows that the method is declared const. Anyway Execute is mutex protected. I did not profile this myself. So, I do not know exactly what was done. It is also possible that the VTune instrumentation makes these calls appear more expensive than they are without instrumentation. Anyway, the mutex protection seems unnecessary in this case, and since RDataFrame has the possibility to define “arbitrary” methods, which can be executed without such a penalty, there must be a better way.
Thanks !
Götz
P.S.”
To call e.g. pt we use once, I think, something like this TMethodCall::InitWithPrototype (“….::Muon”, "pt", ““,false, kConversionMatch)
(btw. the false seems wrong … ). And then TMethodCall::Execute(obj, ret_dbl), for e.g. every event.
When calling the interpreter via gROOT->ProcessLine, we release the ROOT lock during during the execution of the user code and thus other work can proceed in parallel. From your description it looks like TMethodCall::Execute is not properly (i.e as it should) releasing the lock upon calling the user code.
Thanks.
Do you have evidence, e.g. a scaling issue, that these calls are hindering parallelism? We have seen in the past a few instances of VTune not being fully reliable when used in combination with ROOT’s global mutex.
I did not do the profiling myself, I can do some further investigations. Without any particular profiling it was observed that the code scales very poorly with the number of threads i.e. already with 2 threads the cpu efficiency drops. So, there is a problem somewhere. A VTune profile with (ollect threading) shows significant lock contention In ROOT::TVirtualRWMutex::Lock called by TClingCallFunc::exec which is called by TMethodCall::Execute. My personal experience is that lock contention often appears amplified, because, I think, to profile the contention an additional instrumentation is needed which may add a significant overhead if the lock protected scope is very small but visited very often, In the latter case the actual real problem (i.e. a less often visited lock protected scope, which is called significant less often but takes longer) may be overshadowed. In this case the TMethodCall is called per event and object. Typically objects in filter expressions are jets, muons, electrons, taus, so they are not super frequent. Though, there might be tracks. Then the frequency could be high. And the call count is large. In that sense I would be a bit careful, in saying that this is the problem. But this is currently what the profile points at.
I had a look in the code, and the scaling issue is not connected to the method call itself, as @pcanal was mentioning above. The lock is indeed released just before the call.
The problem, however, is that TClingCallFunc::exec indeed does a lot of setup before the call, and it has to do it over and over again for every invocation.
So we clearly need a way to memorise the information retrieved from the interpreter, so the call can proceed without taking the lock again. RDataFrame does this by JITting functions and saving their pointers.
I will later look at InitWithPrototype, because this sounds like it could go in the right direction.
@pcanal found a much faster way that needs the interpreter only for initialisation:
If you can guarantee that the types of arguments don’t change (or do the conversion yourself), one can get directly the function pointer of the method. The way to do it is shown here:
Once you have the function pointer (creator in this case), it can be called as
creator(object, <number of args>, <array of ptr to args>, <address of the result>);
The typedef you see at the very top is also publicly available if you include TClangCallFunc.h.
So the minimum example would look something like:
#include <TClass.h>
#include <TClingCallFunc.h>
tcling_callfunc_Wrapper_t getTrampolineFunction([...]) {
R__LOCKGUARD(gROOTMutex);
auto interfaceMethod = TClass::GetClass("TH1D")->GetMethod("MethodName", "Args")->InterfaceMethod();
return (tcling_callfunc_Wrapper_t)( interfaceMethod );
}
void call([...]) {
void *args[] = { &mode, &size };
ResultType *result;
trampolineFunction(object, 2, args, &result); // <-- (thisPtr, number of args, ptr to array of ptrs to args, ptr to result)
return *result;
}
That works very well. Thanks You very much! I have still have a problem, with the argument list. I would like to create the prototype from types e.g. template<typename…T> std::string makeProto() but for this I need something to convert T into a normalized type name, but I only found ROOT::Internal::GetDemangledTypeName which does not sound like a robust solution. Is there a more official solution ?