ROOT crashes in multi-threaded environment


ROOT Version: 6.18/00
Platform: Ubuntu 18.04 64bit
Compiler: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0


Hello!
I am trying to write a multithreaded data analysis tool using ROOT. I am at a point where everything works fine when running as a single thread, but things break down when there are two or more threads running in parallel.

Each thread reads and writes to different ROOT files. So in principle, each thread should not interfere with each other. This doesn’t seem the case because the application crashes every time.

I was not able to single out the section in the code responsible for the crash but I have written a minimal example that reproduces the issue (see post 3)

What I am asking here is some feedback about how could I troubleshoot the problem and what is the best way to deal with my issue.

What I am trying to do:

  1. First of all, I enable multithread in ROOT using the ROOT::EnableThreadSafety() method.
  2. Then each thread opens a different ROOT TFile files containing many TH1I histograms. The name of the histograms are the same across all the files (this may be the root of the issue)
  3. Then I get the histograms pointers using the Get method of TFile class.
  4. Then I fit the histograms using the Fit method of the TH1I class.
  5. To fit the histograms I use TF1 objects.
  6. Finally I close all the files.

I have already tried moving the histograms to gDirectory = 0 (globally AND/OR one by one) without any improvement.

Your use case looks somehow similar to this:

but you are reading the histos from the file, instead of creating them and filling them with the contents of the tree.

Could you try to add the line that appears in the test above:

// Don't link histos to a particular TDirectory
  TH1::AddDirectory(false);

but I fear the problem is related to that: you get a crash when the histograms are destroyed in a multi-threaded environment, since they belong to the files.

Thank you very much for your prompt reply: I will try your fix.

In the meanwhile, I was able to reproduce the issue in a minimal example. Notice that I use Ctypes and a python script to spawn threads. This is not casual: I use the very same architecture in my application because of our particular software framework.

test.zip (2.0 KB)
You can run the example simply by:

  • make
  • python
  • >>> import test
  • >>> test()

The code structure mimics the one of the actual program. Of course it does not make sense for such a simple script … but it is just to reproduce my issue as accurately as I can.

I tried to add TH1::AddDirectory(false); just after ROOT::EnableThreadSafety(); but the application is still crashing (with a different but still cryptic segfault message)

you get a crash when the histograms are destroyed in a multi-threaded environment since they belong to the files.

I think you are on the right track because most of the times I get a “free invalid pointer” kind of error.

Ok, let’s try something in order to know if the histograms are the issue here: from your code, can you keep the opening of the file but comment out the getting of the histograms (and the fit, consequently)? So every thread opens a file and creates a TF1, but no histograms involved.

Do you still see any error? Can you share here the stack trace you get?

Thank you for the hint. I have tried to comment out the getting and fitting of the histograms and the program does not crash. If I reintroduce the getting of the histograms the program does not crash as well. If I try to fit the histograms

test_hist->Fit(gaussian);

I get a segmentation fault (without any stack trace)

 *** Break *** segmentation violation

 *** Break *** segmentation violation

Process Python exited abnormally with code 139

If I don’t free the TF1 object by commenting out the line

delete gaussian;

I get a crash with the following stack trace:
stack trace.txt (21.5 KB)

The problem seems to be the fit itself. But the reason of the crash is still beyond me …

So the issue is the fitting? It can be that the fitting is not thread-safe (@moneta can you comment on this?).

The error file you send seems to point to a crash when creating the Minuit2 minimizer.

Also, if you get the TH1 from the file, you should not delete them yourself (they belong to the file, so you would cause a double delete).

Also, if you get the TH1 from the file, you should not delete them yourself (they belong to the file, so you would cause a double delete).

Sorry, there was a typo. I commented out the line to free the TF1 object. I am not trying to free the histograms.

Hi,

I think TMInuit is not thread safe. You should use instead Minuit2. Build ROOT with Minuit2 support (-Dminuit2=On) and add this line in your program:
ROOT::Math::MinimizerOptions::SetDefaultMinimizer("Minuit2");

Then in principle the fitting should be protected by locks and should work in multi-threads,

See these past issues that have been solved:

https://sft.its.cern.ch/jira/browse/ROOT-9169
https://sft.its.cern.ch/jira/browse/ROOT-7173

Lorenzo

1 Like

Dear Lorenzo,
thank you for the quick reply. I did as you suggested and I compiled ROOT 6.18.00 with Minuit2 support. In a single thread application the fit is correctly using Minuit2 and this is the output of my test program

>>> test()
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1

****************************************
Minimizer is Minuit2 / Migrad
Chi2                      =      68.4773
NDf                       =           97
Edm                       =   4.6045e-09
NCalls                    =           63
p0                        =      601.014   +/-   4.73414     
p1                        =    0.0001709   +/-   0.00637732  
p2                        =     0.992809   +/-   0.00475785  
Peak = 601.014
Mean = 0.0001709
Sigma = 0.992809

but if I try to create more than one thread the Minuit2 minimizer is not used anymore and the fitter falls back to Minuit.

>>> test()
Warning in <ROOT::Math::FitConfig::CreateMinimizer>: Could not create the Minuit2 minimizer. Try using the minimizer Minuit
Error in <ROOT::Math::FitConfig::CreateMinimizer>: Could not create the Minuit2 minimizer
Error in <ROOT::Math::Fitter::FitFCN>: Minimizer cannot be created
Warning in <Fit>: Abnormal termination of minimization.
fatal error: malformed or corrupted AST file: 'AST record has invalid code'
terminate called after throwing an instance of 'std::runtime_error'
  what():  >>> Interpreter compilation error:
Invalid abbrev number

Process Python terminated (core dumped)

I have tried to load libMinuit2 manually as suggested in one of the thread that you linked

gSystem->Load("libMinuit2");

but the error does not change.

Just in case, I attach the latest version of my minimal example:
test.zip (2.0 KB)

@etejedor @moneta
Do you think this misbehavior is a ROOT bug or I am missing something?
In case you think it is a ROOT bug, do you want me to write a smaller/simpler example and open a bug report on the ROOT bug tracker?

It may be that the 7173 bug is not completely fixed or not fixed for all the possible cases. By the way, the macro provided in the description of that bug report is crashing in my ROOT.

PS for the time being I am getting by putting a mutex on the section where the fitting happens so that only one thread access the Minuit2 fitter at a time.

@moneta is it the expected behaviour that when the user activates multi-threading the Minuit2 minimizer is not used, even if selected?

No, this is not the expected behaviour. I will investigate the reason

Hi,
I cannot reproduce the problem. Using the macro shows in the 7173 bug, that is attached here, it works fine for me for Minuit2 but not TMinuit

.L bug_7173.C
a(0) // run MInuit2 (works fine)
a(1) // run TMinuit (it crashes sometimes)

Can you confirm that it does not work with Minuit2 ?

I cannot instead run your example code.

Lorenzo

bug_7173.C (1.6 KB)

Hi Lorenzo,
I think that the crashes that I was experiencing until now were caused by something external to ROOT (maybe something wrong with my system).

As a matter of fact, today I tried to run your bug_7173.C again and I got the very same behavior as you describe (Minuit2 works fine and TMinuit sometimes crashes). Then I tried to run my sample code and my original application and they run fine too (when using Minuit2).

I feel a little embarrassed because I have no idea what was interfering with ROOT until now and what has changed since yesterday. If and when I have some clue, I will let you know.

Many thanks to you and etejedor for your help and patience.
Grazie ancora e saluti dal Giappone

Good ! I am happy it works.
Thank you and best regards
Lorenzo

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.