TF1 instantiation is extremely slow

We have a fairly complex simulation, which at one point uses CVODE to solve a system of coupled differential equations. To initialize the system, we make a one-time use of TF1 and BrentRootFinder to find the starting (zero) state.

It appears that the TF1 construtor is extraordinarily, and variably, slow. On both a CentOS7 compute node (64 GB, 32 cores) and my MacBook Pro (dual core), the one line

TF1 f1("Function", Tdot, 0., 6., 12);

takes anywhere from 14 to 96 seconds. Since Tdot is a compiled global function, this doesn’t make a lot of sense to me. Is there something we could/should do to speed it up?


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.12, 6.18
Platform: MacOS, CentOS7
Compiler: LLVM, GCC


Something is severely broken here. Lets start with the obvious: what’s Tdot? Do you have a reproducer that shows this?

Are things still slow if you call gSystem->Load("libHist") before?

Tdot is a C++ function defined in the same .cc file:

namespace {
    G4double T5diff(G4double a, G4double b) {     // a^5-b^5 avoid cancellations
    G4double b2=b*b, b3=b2*b, b4=b3*b;
    return (a-b) * (a*(a*(a*(a+b) + b2) + b3) + b4);
  }
}

G4double G4DMCPulsesTES_ODE::Tdot(G4double* x, G4double* par) { // x = IoverIc
  const G4double& IoverIc = x[0];

  const G4double& A = par[0];
  const G4double& Ksubst = par[1];
  const G4double& Tsubst = par[2];
  const G4double& Tw = par[3];
  const G4double& Tc = par[4];
  const G4double& Ic = par[5];
  const G4double& Rn = par[6];
  const G4double& heatcap = par[7];
  const G4double& nsc = par[8];
  const G4double& PP = par[9];
  const G4int verboseLevel = G4int(par[10]);

  if (heatcap <= 0.) return 0.;

  G4double Ti = M_LN2*Tw*A+Tc*pow(std::max(0.,1.-fabs(IoverIc)),nsc);
  G4double KdeltaT5 = Ksubst * T5diff(Tsubst,Ti);
  G4double I2R = IoverIc*Ic*IoverIc*Ic*Rn/(1+exp(-2.*A));

  if (verboseLevel>3) {
    G4cout << "G4DMCPulsesTES_ODE::Tdot IoverIc " << IoverIc
           << " Ti " << Ti/kelvin << " K "
           << "\n log(2)*Tw*A = " << M_LN2*Tw*A
           << "  fabs(IoverIc) = " << fabs(IoverIc)
           << "\n  std::max() = " << std::max(0.,1.-fabs(IoverIc))
           << "  Tc*pow(max,2/3) = " << Tc*pow(std::max(0.,1.-fabs(IoverIc)),nsc)
           << "\n KdeltaT5 " << KdeltaT5/(1e-9*watt) << " nW "
       << " I2R " << I2R/(1e-9*watt) << " nW "
       << " returning " << (KdeltaT5 + I2R)/heatcap << G4endl;
  }

  return (KdeltaT5 + I2R + PP)/(M_LN2*Tw*heatcap);
}

We use the same function in our (non-ROOT) ODE solver code, and it is not slow to evaluate. Does gSystem->Load("libHist") typically need to be done in a compiled executable?

This demo shows the problem. I just ran it and got 70 sec for the TF1 ctor:

Demonstrating slow TF1 constructor
Before ctor: Tue May 25 10:39:31 2021

After  ctor: Tue May 25 10:40:41 2021

tf1_slow.cc (1.3 KB)

Followup: Running repeatedly right away, the lag goes to zero. Waiting a while, switching tasks, the lag comes back.

Additional followup: I commented out all of the body of the Tdot function above, so it consists entirely of:

double Tdot(double* x, double* par) {
  return 0;
}

Recompiling and running again, it took 48 seconds to run the TF1 constructor. Clearly, the content of the function (as I expected) has no bearing on the slowness. As I’ve noted, we do not see this behaviour with any other ROOT stuff that we use (histograms, TTrees, etc.). It is exclusively the TF1 constructor that impacts our jobs.

That’s all compiled code? Including TF1 f1("Function", Tdot, 0., 6., 12);?

Yes. The original problem is part of our SuperCDMS simulation, which is based on Geant4 and uses ROOT (with mutexes) for I/O. The demo I attached above is also compiled with a very simple Make file:

CPPFLAGS := $(shell root-config --cflags)
LDFLAGS := $(shell root-config --libs)
all : tf1_slow

Thanks. Are there network filesystems in the $LD_LIBRARY_PATH?

That doesn’t seem to be relevant. On the CentOS7 HPC system, there are, but on my personal MacBook, everything is completely local.

Thanks. Then we need some sort of trace - callgrind, strace -t, or similar tools. Please also let me know what time says for the simple reproducer: I bet it’s not CPU time.

Just a disclaimer: the questions above were ruling out common causes. I totally believe that this is an issue with ROOT, it’s just that we have never seen this before, we have no other similar reports, and I cannot reproduce it with the example you sent. So we need to dissect your case.

Okay. I’m gone for a week starting today, but we do need to get this sorted out. It’s turned out to be a significant (albeit not the most significant) portion of initialization for one class of our simulation jobs.

Ugh. So MacOS doesn’t have strace, but it has dtrace with a dtruss wrapper to act like strace. But the Mac “system integrity protection” won’t forward DYLD_LIBRARY_PATH to any kind of child process, so dtruss can’t launch the demonstrator.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.