Cling JIT session error: Cannot allocate memory

JKempster · October 16, 2023, 8:41am

Hello,

I am using a common ATLAS wrapper to generate the workspace for a very large fit within ROOT, via RooStats and HistFactory etc. I am using ROOT version 6.28/04. The machine I am running on is very beefy (250GB RAM), but when building the workspace I receive the error below once the memory usage reaches just over 6 GB.

Is this a known issue or is there something I can do to mitigate this? The error refers to cling, but perhaps this is actually an issue in RooFit or HistFactory, or some technical limitation I have not thought of on my machine. I have tried running the following command to remove memory limitations for single processes, but with no success: ulimit -S -s unlimited

Best regards,
Jacob

cling JIT session error: Cannot allocate memory
[#0] ERROR:ObjectHandling -- RooFactoryWSTool::createArg() ERROR in CINT constructor call to create object
[#0] ERROR:ObjectHandling -- RooFactoryWSTool::processExpression() ERRORS detected, transaction to workspace aborted, no objects committed

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00002b63260cf60c in waitpid () from /lib64/libc.so.6
#1  0x00002b632604cf62 in do_system () from /lib64/libc.so.6
#2  0x00002b632282334c in TUnixSystem::StackTrace() () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc11-opt/lib/libCore.so
#3  0x00002b6322820a65 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc11-opt/lib/libCore.so
#4  <signal handler called>
#5  0x00002b63251a928b in RooAbsArg::setAttribute(char const*, bool) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc11-opt/lib/libRooFitCore.so
#6  0x00002b632422dcc3 in RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelWorkspace(RooStats::HistFactory::Measurement&, RooStats::HistFactory::Channel&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-
gcc11-opt/lib/libHistFactory.so
#7  0x00002b6324236164 in RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelModel(RooStats::HistFactory::Measurement&, RooStats::HistFactory::Channel&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc1
1-opt/lib/libHistFactory.so
#8  0x00002b6324240626 in RooStats::HistFactory::MakeModelAndMeasurementFast(RooStats::HistFactory::Measurement&, RooStats::HistFactory::HistoToWorkspaceFactoryFast::Configuration const&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x8
6_64-centos7-gcc11-opt/lib/libHistFactory.so
#9  0x00002b63223c3103 in TRExFit::ToRooStats(bool) const () from /mnt/lustre/projects/epp/general/atlas/jjk31/EFT_Fitting/TRExFitter_GIT/TRExFitter/build/lib/libTRExFitter.so
#10 0x000000000040b8fa in FitExample(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, s
td::allocator<char> > const&) ()
#11 0x0000000000407036 in main ()
===========================================================


The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x00002b63251a928b in RooAbsArg::setAttribute(char const*, bool) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc11-opt/lib/libRooFitCore.so
#6  0x00002b632422dcc3 in RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelWorkspace(RooStats::HistFactory::Measurement&, RooStats::HistFactory::Channel&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-
gcc11-opt/lib/libHistFactory.so
#7  0x00002b6324236164 in RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelModel(RooStats::HistFactory::Measurement&, RooStats::HistFactory::Channel&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x86_64-centos7-gcc1
1-opt/lib/libHistFactory.so
#8  0x00002b6324240626 in RooStats::HistFactory::MakeModelAndMeasurementFast(RooStats::HistFactory::Measurement&, RooStats::HistFactory::HistoToWorkspaceFactoryFast::Configuration const&) () from /cvmfs/atlas.cern.ch/repo/sw/software/0.2/StatAnalysis/0.2.2/InstallArea/x8
6_64-centos7-gcc11-opt/lib/libHistFactory.so
#9  0x00002b63223c3103 in TRExFit::ToRooStats(bool) const () from /mnt/lustre/projects/epp/general/atlas/jjk31/EFT_Fitting/TRExFitter_GIT/TRExFitter/build/lib/libTRExFitter.so
#10 0x000000000040b8fa in FitExample(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, s
td::allocator<char> > const&) ()
#11 0x0000000000407036 in main ()
===========================================================

couet · October 16, 2023, 3:37pm

Welcome to the ROOT forum,
May be @vvassilev or @jonas can help.

JKempster · October 19, 2023, 11:18am

Thank you, if anybody has any ideas about memory limitation here I would be greatful! It seems to happen when the workspace gets “too big”.

Danilo · October 19, 2023, 11:22am

I suggest that perhaps @jonas has a look given that this could well be related to the HistFactory.

JKempster · October 24, 2023, 9:09am

I’m not an expert at all in the ROOT code but I will try to dive into the relevant HistFactory functions based on the error above to see if I can work out exactly where the issue occurs

jonas · October 27, 2023, 9:58am

Hi @JKempster, sorry for the late reply!

First the obvious question: can you maybe share as much as possible the code to produce this workspace? I have no idea where this is coming from now, and I would need some inputs to start investigating. A full reproducer would be best, but if you can’t provide that it would already be good to know how the model looks like (i.e. how many channels, samples, bins, and what kind of systematics).

Cheers,
Jonas

JKempster · October 31, 2023, 10:42am

Hi @jonas ,

Thanks for the response - indeed I can’t really provide a lot of code but I have done some investigations to work out exactly where the problem is occurring. If I understand correctly, when running a fit across multiple different ‘regions’ , HistFactory first builds a workspace and model for each one individually (RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelWorkspace , RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeSingleChannelModel) and then runs a function to combine them together into a single model file (RooStats::HistFactory::HistoToWorkspaceFactoryFast::MakeCombinedModel).

If I reduce the size of the fit to run over only a small number of regions at a time, the workspace production is successful, but when I do more than X amount it fails while producing the single model with the error shown above.
I generated each smaller workspace individually and wrote a separate macro to pass them into the ‘MakeCombinedModel’ function, and that is successful. My macro crashes however when trying to also put all of the associated histograms from those model.root files into a single file alongside the RooWorkspace and Measurement objects.

I haven’t been able to quite track down how these histograms are manipulated / copied around in ROOT when combining models, but I have a suspicion tha all of the associated histograms from each region might be read into a single vector and then written out to the final model file. If I try to do this process locally, the macro crashes with a segfault with no particularly helpful error.
However if I instead do this one histogram at a time in a loop (read, write, read, write, read, write…) the process does not crash (and fortunately is still fast).

This is therefore quite hypothetical, but it is possible that when HistFactory is trying to combine together different workspaces (as it also builds them on the fly) it realises it is going to put too many histograms into memory at the same time and kills the process, even before actually showing that a large amount of memory is being used? I have run this on nodes with 256GB RAM and had the same issue, so I wonder if it is some deeper architecture issue (does RAM have to be discretised into cerrtain sized stacks/chunks?). For the complexity of fit I have been running the number of histograms could be quite large (30 regions, 40 MC samples (many more files but combined together), 10 bins per region).

Does any of that sound plausible?

Cheers,
Jacob

system · November 14, 2023, 10:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

jonas · November 20, 2023, 8:36am

Hi!

Which ROOT version are you using? The problem you describe sounds like the one fixed by this PR:

The fist release with that fix was ROOT 6.28.06. If you were using an older ROOT version, could you try to see if the problem is gone with this newer version?

In general, when reporting crashes, it is the best practice to check first if they are still there with ROOT master, which you can easily get from cvmfs.

Cheers,
Jonas

JKempster · November 22, 2023, 12:02pm

Hi Jonas,

Thank you for the response. I have been using ROOT 6.28/04 - so a newer version that the one mentioned above. I can also switch to ROOT master and see if that makes a difference, if it would be useful.

Cheers,
Jacob

jonas · November 23, 2023, 12:47pm

Hi, if you’re using ROOT 6.28/04, then in doesn’t contain the fix for the excessive file opening. As I said in my reply, it was only fixed in 6.28/06.

So yes, it would be very useful if you could change to ROOT master and see if the problem persists!

JKempster · November 23, 2023, 1:12pm

Hi Jonas,

Ah! I apologise, I misread your message as “6.28/02”. I will try to update to the newest version and see if that resolves the issue. I will report back here.

Cheers,
Jacob

JKempster · November 28, 2023, 9:44am

Good morning,

I ran the same test with ROOT 6.28/04 and unfortunately this did not resolve the issue, the stack trace remain the same.

Cheers,
Jacob

jonas · November 28, 2023, 10:42am

Hi @JKempster,

it’s expected that with 6.28/04 it is still the same. The fix for the file opening problem was only introduced in ROOT 6.28/06:

So to reiterate my question: what happens if you use 6.28/06 or newer?

Cheers,
Jonas

JKempster · November 30, 2023, 6:53am

Hi Jonas,

My apologies for getting the versions confused (again). I can confirm that I have now run using ROOT 6.28/06 and unfortunately still see the same crash.

Best regards,
Jacob

Axel · December 1, 2023, 1:39pm

FYI this is very likely cling jit can hit VMA limit · Issue #14156 · root-project/root · GitHub

JKempster · December 4, 2023, 10:45am

Hi @Axel ,

Thank you for this! Indeed it looks like it could be the same underlying issue. In the other thread they were able to circumvent it by changing the way various calls were made in the code. In this case I think the code leading to the crash is directly party of HistFactory.

I am completely unfamiliar with VMAs etc, @jonas do you know if particular lines in the functions I pointed to above in the crash could be generating VMAs and leading to this issue if too many are needed? Maybe I can inject some of the vmsize count code used here cling jit can hit VMA limit · Issue #14156 · root-project/root · GitHub into HistFactory and see if I can investigate a little further.

Cheers,
Jacob

jonas · December 14, 2023, 5:27pm

Hi @JKempster,

it’s pretty clear where HistFactory calls the interpreter: in any call to RooWorkspace::factory(), because the parsing of the factory language and creating the corresponding object on-the-fly must be done with the JIT.

I suggest to fix this by avoiding all calls to RooWorkspace::factory() in HistFactory:

github.com/root-project/root

[RF] Avoid factory language in HistFactory

root-project:master ← guitargeek:hf_no_factory

opened 05:20PM - 14 Dec 23 UTC

guitargeek

+92 -107

The RooWorkspace factory language should not be used inside the HistFactory impl…ementation, because each time you use `RooWorkspace::factory()`, the interpreter is called. For very large workspaces, the excessive number of interpreter calls can crash the process, because memory allocation is growing. This avoids the memory allocation problem in the interpreter that was reported on the forum: https://root-forum.cern.ch/t/cling-jit-session-error-cannot-allocate-memory/56744

Once this PR is merged, you should not see your problems anymore with ROOT master.

How important is it for you that this works also in the 6.30 series? Then I would also have to backport the fix to the 6.30 branch in order to appear in the 6.30.04 release. But if you don’t care about waiting for 6.32.00, then I don’t need to do that (backports always have a slight risk of breaking something).

Cheers,
Jonas

JKempster · December 18, 2023, 10:59am

Hi @Jonas ,

Thank you for this! I look forward to testing the fix. Don’t worry about backporting into the 6.30 branch, I am happy to wait for 6.32.00

Cheers,
Jacob