Segfault from root when using python threading of numpy code with input from AsNumpy()

emanca · February 11, 2020, 2:18pm

hi people! have a question for you. I am using a code that makes almost no use of root, a part from using some numpy arrays produced with AsNumpy in RDF as input to a scipy minimizer. The problem arises when i parallelise some independent pieces of calculation (executed in numpy) using python threading.
here sometimes i get a segfault from root. this is the error i am getting:

The lines below might hint at the cause of the crash [....]

# that might help us fixing this issue.

#6 0x00007f186ed1a0ed in getenv () from /lib64/libc.so.6

#7 0x00007f186d17cd89 in mkl_serv_getenv () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/../../../../libmkl_rt.so

#8 0x00007f185af92b61 in mkl_vml_kernel_ReadEnvVarMode () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/../../../../libmkl_vml_def.so

#9 0x00007f185af928f3 in mkl_vml_kernel_GetMode () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/../../../../libmkl_vml_def.so

#10 0x00007f185af928c6 in mkl_vml_kernel_GetTTableIndex () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/../../../../libmkl_vml_def.so

#11 0x00007f185e58b2e0 in vsLinearFrac () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/../../../../libmkl_intel_lp64.so

#12 0x00007f1869e94868 in trivial_three_operand_loop () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/umath.so

#13 0x00007f1869e942ed in execute_legacy_ufunc_loop.A () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/umath.so

#14 0x00007f1869e8297d in PyUFunc_GenericFunction.A () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/umath.so

#15 0x00007f1869e7e75e in ufunc_generic_call.A () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/umath.so

#16 0x00007f186fa2d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0

#17 0x00007f186fa2e29c in PyObject_CallFunctionObjArgs () from /lib64/libpython2.7.so.1.0

#18 0x00007f186daae19e in array_add () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/multiarray.so

#19 0x00007f1869ec63c5 in double_add () from /home/users/emanca/.local/lib/python2.7/site-packages/numpy/core/umath.so

#20 0x00007f186fa2989c in binary_op1 () from /lib64/libpython2.7.so.1.0

#21 0x00007f186fa2b511 in PyNumber_Add () from /lib64/libpython2.7.so.1.0

#22 0x00007f186fac1ecc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

#23 0x00007f186fac657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

#24 0x00007f186fac8efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

#25 0x00007f186fa5294d in function_call () from /lib64/libpython2.7.so.1.0

#26 0x00007f186fa2d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0

#27 0x00007f186fac15bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

#28 0x00007f186fac657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

#29 0x00007f186fac657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

#30 0x00007f186fac8efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

#31 0x00007f186fa52858 in function_call () from /lib64/libpython2.7.so.1.0

#32 0x00007f186fa2d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0

#33 0x00007f186fa3c995 in instancemethod_call () from /lib64/libpython2.7.so.1.0

#34 0x00007f186fa2d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0

#35 0x00007f186fabf7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0

#36 0x00007f186faf76e2 in t_bootstrap () from /lib64/libpython2.7.so.1.0

#37 0x00007f186f7c9e25 in start_thread () from /lib64/libpthread.so.0

#38 0x00007f186edda34d in clone () from /lib64/libc.so.6

and sometimes also this

Error in <TClingCallFunc::IFacePtr(kind)>: Attempt to get interface while invalid

repeated for every thread.

Do you have any idea? I am puzzled since I am actually not using root in my code. This is the function that gets called by a scipy minimiser:


def nllSimul(x, nEtaBins, datasetJ, datasetZ, datasetJGen, datasetZGen):

    idx2D = 0
    threads = []
    que = Queue.Queue()
    for idx in datasetZ:

        if datasetZ[idx]["mass"].shape[0]<1:
            continue

        i = idx[0]
        j = idx[1]

        idx2D+=1

        t = threading.Thread(target=lambda q,  x,nEtaBins,i,j,datasetZ,datasetZGen: q.put(nllZ(x,nEtaBins,i,j,datasetZ,datasetZGen)), args=(que, x,nEtaBins,i,j,datasetZ,datasetZGen))
        t.start()
        threads.append(t)

    # Join all the threads
    for t in threads:
        t.join()

    # Check thread's return value
    nll = []
    while not que.empty():
        #print que.get()
        nll.append(que.get())

    return np.sum(np.array(nll))

the function that gets multithreaded contains only numpy code. I repeat that only the input “dataset*” comes from root.

Thanks in advance!

Cheers,

Elisabetta

swunsch · February 11, 2020, 3:18pm

Hi!

Just to verify: If you don’t use threading, everything is fine?

I don’t see where this could happen. @etejedor Do you have an idea?

Could you verify that the datasets are valid inside the nllSimul function? Just print parts of it or so.

Best
Stefan

emanca · February 11, 2020, 3:28pm

Hi Stefan! Yes, if I don’t use threading I don’t have any problem.
Actually, the segfault is non-deterministic it seems: it just happens “sometimes”.
I can try and see if there is a connection between the segfault and the datasets not being valid.

Thanks!

Elisabetta

swunsch · February 11, 2020, 3:39pm

Thanks for checking!

I’ve check also that the Queue is thread safe.

Next, the nllZ function you put into the lambda could cause the issue. If we can trust the stacktrace, you do numpy operations which are not threadsafe and cause a crash there. Look for array additions, since we have the symbols double_add, PyArray_Add and array_add in the stacktrace.

Is the stacktrace always the same if it appears?

emanca · February 11, 2020, 3:49pm

Yes, the stack trace is the same. Here is the code of the nllZ function:
It computes the neg log likelihood of a complex PDF to fit the Z dimuon mass in a given bin of eta of the two muons. All the bins are independent and that’s why in principle they can be parallelised.

The PDF is indeed built using numpy.sum(), but it shouldn’t interfere with the other bins, the only thing happening is that all the functions must have access to the vector x simultaneously.

Cheers,
Elisabetta

def nllZ(x,nEtaBins,i,j,dataset,datasetGen):
        
    z = dataset[(i,j)]["mass"]
    
    ieta1, _ = roll1Dto2D(i,1)
    ieta2, _ = roll1Dto2D(j,1)
    
    #retrieve parameter value

    A1 = x[ieta1]
    A2 = x[ieta2]
    M1 = x[nEtaBins+i]
    M2 = x[nEtaBins+j]

    #bin the genMass

    genMass = np.histogram(datasetGen[(i,j)]["genMass"], bins=1000, range=(75.,115.))[0]
    vals = np.linspace(75.,115.,1000)

    #print mean,A1,A2,M1,M2

    term1 = A1+M1/dataset[(i,j)]["c1"]
    term2 = A2-M2/dataset[(i,j)]["c2"]

    h=np.outer(np.sqrt(term1*term2),vals)
    #dim_h = (nev,100)

    sig = dataset[(i,j)]["massErr"]

    z_ext = z[:,np.newaxis]
    genMass_ext = genMass[np.newaxis,:]
    sig_ext = sig[:,np.newaxis]

    
    l = np.sum(genMass_ext*np.exp(-np.power(z_ext - h.astype('float64'), 2.) / (2 * np.power(sig_ext, 2.))),axis=1)
    #print l.shape
    
    nll = np.sum(np.log(l),dtype='float64')
    
    return -nll

swunsch · February 11, 2020, 4:10pm

Actually this looks fine. I cannot spot any issues since you write always to new objects. However, be aware that operations such as z_ext = z[:, np.newaxis] don’t copy! They create a view on the original array.

I don’t know how we can help here. But to debug I would probably put numpy.copy(...) around stuff you access from multiple threads so that you can localize the operation which breaks your code.

@etejedor Do you understand how we catch a segfault in numpy from ROOT? Does this hint to the fact that we are causing this problem and not one of the functions above?

etejedor · February 11, 2020, 4:49pm

I don’t see where ROOT can play a role here. Am I correct that there is no ROOT (PyROOT) used from the application? The stack trace does not show anything related to ROOT, and I don’t know where that TCling error can come from…

swunsch · February 11, 2020, 6:39pm

It’s using AsNumpy to read the dataset from a ROOT file. There it comes in
And the stacktrace is actually provided by ROOT. Here the question: Why ROOT is able to get the stacktrace of a numpy segfault?

pcanal · February 11, 2020, 7:39pm

Why ROOT is able to get the stacktrace of a numpy segfault?

Once the ROOT signal handler is loaded (so once libCore is loaded), it will report any crashes …

swunsch · February 12, 2020, 8:58am

Thanks for clarifying, I wasn’t aware that we catch everything everything.

system · February 26, 2020, 9:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.