Memory leak check

Dear experts,

I have a task that has to be ran in a controlled environment where the CPU and memory resources are limited. Therefore, even if my task runs perfectly locally, it not passes the test in the infrastructure where it should run over large datasets.
The first thing that I did was to check my task for memory leaks, therefore I used valgrind.

Here is the output:
==196513== Memcheck, a memory error detector
==196513== Copyright (C) 2002-2017, and GNU GPL’d, by Julian Seward et al.
==196513== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info
==196513== Command: root -b -q runV2.C
==196513== Parent PID: 41867
==196513==
==196513==
==196513== HEAP SUMMARY:
==196513== in use at exit: 70 bytes in 1 blocks
==196513== total heap usage: 4 allocs, 3 frees, 72,914 bytes allocated
==196513==
==196513== 70 bytes in 1 blocks are still reachable in loss record 1 of 1
==196513== at 0x4840217: operator new[](unsigned long) (vg_replace_malloc.c:579)
==196513== by 0x403849: SetRootSys (rootx.cxx:143)
==196513== by 0x403849: main (rootx.cxx:297)
==196513==
==196513== LEAK SUMMARY:
==196513== definitely lost: 0 bytes in 0 blocks
==196513== indirectly lost: 0 bytes in 0 blocks
==196513== possibly lost: 0 bytes in 0 blocks
==196513== still reachable: 70 bytes in 1 blocks
==196513== suppressed: 0 bytes in 0 blocks
==196513==
==196513== For lists of detected and suppressed errors, rerun with: -s
==196513== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

From this I get that I really have a memory leak in my task, however, I do not understand where … Can you help me with a hint of what should I do next? Any other suggestions would be highly welcomed!

Thank you in advance!

Try:

valgrind --suppressions=`root-config --etcdir`/valgrind-root.supp `root-config --bindir`/root.exe -b -n -q -l 'runV2.C++g'

Thank you for your reply! I got a lot of errors which at compilation, the ones related to including header files I was abler to solve by myself, but I have some errors that I cannot solve.
In my macro I call other macros:

#ifdef __CLING__
#include </home/.../AddTask.C>
#endif
void runV2()
{
AddTask();
}

the errors that I get: error: ‘AddTask’ was not declared in this scope; did you mean ‘AliAnalysisTask’?

Do you know how should I solve these errors in order to include my AddTask macro as it is?

Remove this “#ifdef” condition (it will not trigger when compiling, but you always need this “#include”).

Actually, you could probably also use:
#if defined(__CLING__) || defined(__ACLIC__)

Thank you ! It worked!
Here is the output of the valgrind command that you have sent me earlier:

==1763894== 
==1763894== HEAP SUMMARY:
==1763894==     in use at exit: 354,755,886 bytes in 566,714 blocks
==1763894==   total heap usage: 3,708,852 allocs, 3,142,138 frees, 4,459,387,290 bytes allocated
==1763894== 
==1763894== LEAK SUMMARY:
==1763894==    definitely lost: 3,730 bytes in 54 blocks
==1763894==    indirectly lost: 307,505 bytes in 1,906 blocks
==1763894==      possibly lost: 186,743 bytes in 691 blocks
==1763894==    still reachable: 347,848,718 bytes in 520,053 blocks
==1763894==                       of which reachable via heuristic:
==1763894==                         newarray           : 14,780,608 bytes in 828 blocks
==1763894==                         multipleinheritance: 5,408 bytes in 15 blocks
==1763894==         suppressed: 6,409,190 bytes in 44,010 blocks
==1763894== Rerun with --leak-check=full to see details of leaked memory
==1763894== 
==1763894== Use --track-origins=yes to see where uninitialised values come from
==1763894== For lists of detected and suppressed errors, rerun with: -s
==1763894== ERROR SUMMARY: 1580895 errors from 1000 contexts (suppressed: 23043 from 390)

What can you tell from it?

Nothing really suspicious.

Try:

valgrind --tool=helgrind --suppressions=`root-config --etcdir`/helgrind-root.supp ...

BTW. Can it be that you have some “strong” limits set in your remote “controlled environment”?
Try (assuming you use “bash”):
ulimit -H -a
ulimit -S -a

This is the ouput at the end of the execution:

==1765664== 
==1765664== Use --history-level=approx or =none to gain increased speed, at
==1765664== the cost of reduced accuracy of conflicting-access information
==1765664== For lists of detected and suppressed errors, rerun with: -s
==1765664== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)

But There are also some lines in the middle of the macro execution.

==1765664== ---Thread-Announcement------------------------------------------
==1765664== 
==1765664== Thread #1 is the program's root thread
==1765664== 
==1765664== ----------------------------------------------------------------
==1765664== 
==1765664== Thread #1: lock order "0x57A0340 before 0x402E968" violated
==1765664== 
==1765664== Observed (incorrect) order is: acquisition of lock at 0x402E968
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x4015592: _dl_open (dl-open.c:786)
==1765664==    by 0x537634B: dlopen_doit (dlopen.c:66)
==1765664==    by 0x52CA837: _dl_catch_exception (dl-error-skeleton.c:208)
==1765664==    by 0x52CA902: _dl_catch_error (dl-error-skeleton.c:227)
==1765664==    by 0x5376B58: _dlerror_run (dlerror.c:170)
==1765664==    by 0x53763D9: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==1765664==    by 0x63F2C34: cling::utils::platform::DLOpen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (in /home/amelia/alice/sw/ubuntu2004_x86-64/ROOT/v6-24-06-local1/lib/libCling.so.6.24.06)
==1765664==    by 0x62DABCE: cling::DynamicLibraryManager::loadLibrary(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) (in /home/amelia/alice/sw/ubuntu2004_x86-64/ROOT/v6-24-06-local1/lib/libCling.so.6.24.06)
==1765664==    by 0x62E7BC7: cling::Interpreter::loadFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, cling::Transaction**) (in /home/amelia/alice/sw/ubuntu2004_x86-64/ROOT/v6-24-06-local1/lib/libCling.so.6.24.06)
==1765664==    by 0x63D777A: cling::MetaSema::actOnLCommand(llvm::StringRef, cling::Transaction**) (in /home/amelia/alice/sw/ubuntu2004_x86-64/ROOT/v6-24-06-local1/lib/libCling.so.6.24.06)
==1765664== 
==1765664==  followed by a later acquisition of lock at 0x57A0340
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x579A409: _nss_compat_getpwuid_r (compat-pwd.c:1052)
==1765664==    by 0x524C332: getpwuid_r@@GLIBC_2.2.5 (getXXbyYY_r.c:315)
==1765664==    by 0x421652DD: XrdCl::DefaultEnv::DefaultEnv() (in /home/amelia/alice/sw/ubuntu2004_x86-64/XRootD/v5.4.0-2/lib/libXrdCl.so.3.0.0)
==1765664==    by 0x42166327: XrdCl::DefaultEnv::Initialize() (in /home/amelia/alice/sw/ubuntu2004_x86-64/XRootD/v5.4.0-2/lib/libXrdCl.so.3.0.0)
==1765664==    by 0x420DA2DC: __static_initialization_and_destruction_0(int, int) [clone .constprop.0] (in /home/amelia/alice/sw/ubuntu2004_x86-64/XRootD/v5.4.0-2/lib/libXrdCl.so.3.0.0)
==1765664==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==1765664==    by 0x4011C90: call_init (dl-init.c:30)
==1765664==    by 0x4011C90: _dl_init (dl-init.c:119)
==1765664==    by 0x52CA894: _dl_catch_exception (dl-error-skeleton.c:182)
==1765664==    by 0x401642C: dl_open_worker (dl-open.c:758)
==1765664==    by 0x52CA837: _dl_catch_exception (dl-error-skeleton.c:208)
==1765664== 
==1765664== Required order was established by acquisition of lock at 0x57A0340
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x579A409: _nss_compat_getpwuid_r (compat-pwd.c:1052)
==1765664==    by 0x524C332: getpwuid_r@@GLIBC_2.2.5 (getXXbyYY_r.c:315)
==1765664==    by 0x524B9FA: getpwuid (getXXbyYY.c:135)
==1765664==    by 0x4B787B3: UnixHomedirectory (TUnixSystem.cxx:3938)
==1765664==    by 0x4B787B3: TUnixSystem::UnixHomedirectory(char const*, char*, char*) (TUnixSystem.cxx:3925)
==1765664==    by 0x4A28F7D: TROOT::InitSystem() (TROOT.cxx:1937)
==1765664==    by 0x4A29595: TROOT::TROOT(char const*, char const*, void (**)()) (TROOT.cxx:667)
==1765664==    by 0x4A2B333: TROOTAllocator (TROOT.cxx:334)
==1765664==    by 0x4A2B333: ROOT::Internal::GetROOT1() (TROOT.cxx:376)
==1765664==    by 0x4A1E66F: __static_initialization_and_destruction_0 (TROOT.cxx:584)
==1765664==    by 0x4A1E66F: _GLOBAL__sub_I_TROOT.cxx (TROOT.cxx:3125)
==1765664==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==1765664==    by 0x4011C90: call_init (dl-init.c:30)
==1765664==    by 0x4011C90: _dl_init (dl-init.c:119)
==1765664== 
==1765664==  followed by a later acquisition of lock at 0x402E968
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x4015592: _dl_open (dl-open.c:786)
==1765664==    by 0x52C97E0: do_dlopen (dl-libc.c:96)
==1765664==    by 0x52CA837: _dl_catch_exception (dl-error-skeleton.c:208)
==1765664==    by 0x52CA902: _dl_catch_error (dl-error-skeleton.c:227)
==1765664==    by 0x52C9914: dlerror_run (dl-libc.c:46)
==1765664==    by 0x52C9914: __libc_dlopen_mode (dl-libc.c:195)
==1765664==    by 0x52AD5CB: nss_load_library (nsswitch.c:359)
==1765664==    by 0x52ADE78: __nss_lookup_function (nsswitch.c:467)
==1765664==    by 0x57986EA: init_nss_interface (compat-pwd.c:93)
==1765664==    by 0x57986EA: init_nss_interface (compat-pwd.c:89)
==1765664==    by 0x579A634: _nss_compat_getpwuid_r (compat-pwd.c:1055)
==1765664==    by 0x524C332: getpwuid_r@@GLIBC_2.2.5 (getXXbyYY_r.c:315)
==1765664== 
==1765664==  Lock at 0x57A0340 was first observed
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x579A409: _nss_compat_getpwuid_r (compat-pwd.c:1052)
==1765664==    by 0x524C332: getpwuid_r@@GLIBC_2.2.5 (getXXbyYY_r.c:315)
==1765664==    by 0x524B9FA: getpwuid (getXXbyYY.c:135)
==1765664==    by 0x4B787B3: UnixHomedirectory (TUnixSystem.cxx:3938)
==1765664==    by 0x4B787B3: TUnixSystem::UnixHomedirectory(char const*, char*, char*) (TUnixSystem.cxx:3925)
==1765664==    by 0x4A28F7D: TROOT::InitSystem() (TROOT.cxx:1937)
==1765664==    by 0x4A29595: TROOT::TROOT(char const*, char const*, void (**)()) (TROOT.cxx:667)
==1765664==    by 0x4A2B333: TROOTAllocator (TROOT.cxx:334)
==1765664==    by 0x4A2B333: ROOT::Internal::GetROOT1() (TROOT.cxx:376)
==1765664==    by 0x4A1E66F: __static_initialization_and_destruction_0 (TROOT.cxx:584)
==1765664==    by 0x4A1E66F: _GLOBAL__sub_I_TROOT.cxx (TROOT.cxx:3125)
==1765664==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==1765664==    by 0x4011C90: call_init (dl-init.c:30)
==1765664==    by 0x4011C90: _dl_init (dl-init.c:119)
==1765664==  Address 0x57a0340 is 0 bytes inside data symbol "lock"
==1765664== 
==1765664==  Lock at 0x402E968 was first observed
==1765664==    at 0x4846563: mutex_lock_WRK (hg_intercepts.c:918)
==1765664==    by 0x484A460: pthread_mutex_lock (hg_intercepts.c:934)
==1765664==    by 0x52C93B7: _dl_addr (dl-addr.c:131)
==1765664==    by 0x4B7349D: SetRootSys() (TUnixSystem.cxx:465)
==1765664==    by 0x4B7CD47: Init (TUnixSystem.cxx:612)
==1765664==    by 0x4B7CD47: TUnixSystem::Init() (TUnixSystem.cxx:583)
==1765664==    by 0x4A28F67: TROOT::InitSystem() (TROOT.cxx:1934)
==1765664==    by 0x4A29595: TROOT::TROOT(char const*, char const*, void (**)()) (TROOT.cxx:667)
==1765664==    by 0x4A2B333: TROOTAllocator (TROOT.cxx:334)
==1765664==    by 0x4A2B333: ROOT::Internal::GetROOT1() (TROOT.cxx:376)
==1765664==    by 0x4A1E66F: __static_initialization_and_destruction_0 (TROOT.cxx:584)
==1765664==    by 0x4A1E66F: _GLOBAL__sub_I_TROOT.cxx (TROOT.cxx:3125)
==1765664==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==1765664==    by 0x4011C90: call_init (dl-init.c:30)
==1765664==    by 0x4011C90: _dl_init (dl-init.c:119)
==1765664==    by 0x4001139: ??? (in /lib/x86_64-linux-gnu/ld-2.31.so)
==1765664==  Address 0x402e968 is 2312 bytes inside data symbol "_rtld_local"
==1765664== 
==1765664== 

(If I understand correctly) I cannot touch the environment that I have to run my task within. All I can do is prepare the code locally and test it one a couple of data files, and then I have to upload the final version of my task in order to be ran automatically on large datasets.

Well, your valgrind outputs do not really show anything wrong in your code.
(@Ailema “ERROR SUMMARY: 2 errors from 1 contexts” … Do you show the output for both errors?)
(@Axel Are you aware of the two reported “lock order violations”?)

Can you run your application on your remote host with the same set of “a couple of data files” as on your local machine?

On both machines, check how much RAM your application uses while running (e.g., run “top -d 1” and observe “VIRT”, “RES”, “SHR”).

Compare the “ulimit” outputs from your local and remote machines (see “man bash” for details). Note that, as an “ordinary” user, you can change -S “soft limits”, but you cannot increase -H “hard limits”.

Regarding the lock order issue: I really don’t know, and this seems to be deep inside libc. @pcanal do you agree that there’s likely nothing for us to fix?

Yes it is somewhat concerning and it indeed involving glibc's getpwuid_r taking a lock and then indirectly calling dlopen and taking its locks (and xrootd and ROOT indirectly taking the same getpwuid_r mutex in their static init which is called from dlopen). We could improve the situation (at the cost of extra complexity in xrootd and ROOT) by delaying those lookups. Due to the mitigation described in the next paragraph, it is unlikely to be ‘worth’ the investment.

The problem is mitigated by the fact that in practice most library loading will be done during the initialization phase (of the user code) and thus in single thread mode (eg. most libraries will be loaded during the script loading).