Memory leak relating to TFiles?

jwruss · October 25, 2020, 2:28am

In an effort to save processing time in the following project:

I created the files “makeAntSphericalCosineProductsFiles.cxx” and “makeDeltaTMapFiles.cxx” to generate TFiles containing histograms to be referenced by functions in the source file “src/eventCorrelator.cc”. However, when I run the binary of the file “testMapPointing.cxx”, I see using the command “htop” that memory usage increases over time (memory leak). I figured that this was due to how I reference the TFiles, so in the functions referencing these TFiles I tried explicitly deleting pointers to the histograms at the end of these functions. I also tried Close() on these TFiles at the end of these functions. I see no difference in the memory usage after deleting the histograms. The binary “testMapPointing” also crashes when I tried adding Close() to the functions “fillMapsPair()” and “fillFlatMapsPair()”. The Close() command I have in the function “getTotalPowerMaps()” appears to not cause a crash, but it’s difficult to tell if there is an improvement since there still appears to be a memory leak.

Still thinking that the memory leak was due to the handling of these TFiles, I then changed from using (TFile) objects to (TFile *) pointers, and tried using the suggestions found here. The “testMapPointing” binary still crashed. So I reverted back to using (TFile) objects and commented out the lines in the functions “fillMapsPair()”, “fillFlatMapsPair()”, “getTotalPowerMaps()” inspired by the link above.

With how the three functions mentioned above are currently written in “src/eventCorrelator.cc”, does anyone notice an obvious memory leak source? I don’t think it has to do with usage of OpenMP, because this leak still happens when I set “export OMP_NUM_THREADS=1”.

I wish I could share more on the files necessary to run the attached project, but they are too huge to share here and I don’t have a cloud account at CERN to share them. What I do have is zshenv and zshrc files in the zipped archive which sets paths used in this project, as they depend on the projects which can be cloned from here.

ROOT Version: 6.18/00
Platform: Linux (CentOS Linux 7 (Core))
Compiler: gcc

bellenot · October 25, 2020, 10:18am

Your zip file only contains two shell scripts and an empty directory. Can you provide a minimal reproducer, or a snapshot of the code, with the relevant part? Did you try to use valgrind on your project to detect the memory leaks? Something like:

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ./yourprogram

jwruss · October 25, 2020, 5:59pm

My mistake, I forgot to add the “-r” flag when I zipped the archive together. I updated the initial post so that the code is included. Sorry about that.

I did initially try the following valgrind command:

valgrind --num-callers=30 --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log --leak-check=full --show-leak-kinds=all -v ./buld/testMapPointing 54

But given the version of valgrind I was precompiled, I eventually ended up with the following error at the end:

--69715:0: aspacem Valgrind: FATAL: VG_N_SEGMENTS is too low.
--69715:0: aspacem   Increase it and rebuild.  Exiting now.

I tried first a macro version of the function, and then a binary. But in both instances I got the output above. It is why I’m having a colleague look at it on their computer network.

But if someone here could recognize the problem before then, that would be great!

jwruss · October 26, 2020, 9:22pm

@bellenot, so I tried valgrind again. It still reports running out of memory. With the command

valgrind --num-callers=30 --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log --leak-check=full --show-leak-kinds=all -v ./build/testMapPointing 54

I got the attached output:
valgrind-testMapPointing-part_54.txt (32.8 KB)

bellenot · October 27, 2020, 7:48am

OK, thanks. Maybe @pcanal can take a closer look

RENATO_QUAGLIANI · October 28, 2020, 8:27pm

Maybe i am too naive here, but have you tried to declare the function using

MakeXXX( TFile & file)
Or
MakeXXX( TFile & &)
Or inlining the function?

RENATO_QUAGLIANI · October 28, 2020, 8:31pm

Also i think you can also return directly a TH2D from your functions and use the TH1::SetDirectory calls to associate a th2d to a file or a directory. (at least i usually do this when i need to shuffle and associate writing of histos to a tfile). Apologize if my comments are not going to help in your case

pcanal · October 28, 2020, 9:16pm

Why? How many files is there? Do you have TTrees? How many histograms? How many (total) number of bins?

jwruss · October 28, 2020, 9:42pm

Thank you, @RENATO_QUAGLIANI, but the TFiles are referenced in functions found under src/eventCorrelator.cc. The binary testMapPointing refers to these functions, but doesn’t have TFile input of its own.

When I tried deleting the references to the files, I also tried when initially opening them to use TH1::SetDirectory as suggested here, but that didn’t seem to work?

jwruss · October 28, 2020, 9:54pm

Hi @pcanal,

To clarify your questions, the files being referenced are the ones produced by the binaries makeAntSphericalCosineProductsFiles and makeDeltaTMapFiles. Each of these produces two files containing TH2D histograms, not in TTrees. Two files are produced because one is more finely binned. The more coarsely-binned version of the files contains histograms of dimension 180 x 100 bins, while the more finely binned version of the files contains histograms of dimension 760 x 240 bins.

In the files produced by makeDeltaTMapFiles, there are 672 TH2D histograms. In the files produced by makeAntSphereicalCosineProductFiles, there are 768 TH2D histograms.

The size of the produced files are as follows:
85M deltaTMapCoarse.root
859M deltaTMapFine.root
46M antSphericalCosineProductsCoarse.root
449M antSphericalCosineProductsFine.root

jwruss · November 7, 2020, 12:20am

Working with one of my collaborators to run a smaller job so that valgrind doesn’t run out of memory, we were able to glean the following output:

valgrind_results.zip (146.1 KB)

In an effort to expedite resolution to the memory leak problem, I have also uploaded the project to GitHub:

Updates from November 4th were motivated by the output from the valgrind log file. However, it looks like the changes didn’t help resolve the issue.

From the valgrind file, what suggestions do people have on how to resolve the issue?

eguiraud · November 9, 2020, 9:34am

Hi @jwruss,
did you run valgrind with --suppressions=$ROOTSYS/etc/valgrind-root.supp? It seems that many of the warnings come from the interpreter, but those are harmless and should be suppressed by the above suppression file.

Also there are several ROOT warnings and errors being printed, including missing input files and unwritable output files – although they should not cause memory leaks/hogging, it is often trickier to debug problems if they are entangled with other issues.

Most importantly: you should use valgrind --tool=massif rather than the default valgrind’s tool (memcheck) to get a log of where memory is allocated.

Unfortunately without the input files we cannot try to reproduce the problem on our side. But maybe you can reproduce the problem running on the same file many times, and just share that one file?

Cheers,
Enrico

jwruss · November 9, 2020, 8:42pm

Hi Enrico,

Thank you for your response. When trying to run valgrind on my end, from the project directory I have tried the commands

valgrind --num-callers=30 --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log --leak-check=full --show-leak-kinds=all -v ./build/testMapPointing 54

and

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log ./build/testMapPointing 54

However, using machines with CentOS on them, valgrind is pulled from a repository precompiled. Valgrind ends up exiting saying it is “out of memory”.

I have a collaborator who doesn’t appear to have this problem with valgrind trying to help me diagnose the problem from their end. It seems they have been running valgrind like in the second command

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log ./build/testMapPointing 54


I have asked them to try running it like the first command to see if any more information can be gleaned. I have also just forwarded along your message. Hopefully our suggestions will help resolve this. If we figure anything out, I will post it here.

Best,
John

jwruss · November 18, 2020, 12:35am

Hi Enrico,

Although not ideal, I have found a way to try and debug in parallel with my collaborator. I was able to download the relevant files to my work machine, but when it runs any of the above commands that is all it can run. I have tried running it this way using the command

valgrind --num-callers=30 --suppressions=$ROOTSYS/etc/valgrind-root.supp --log-file=valgrind-testMapPointing-part_54.log --leak-check=full --show-leak-kinds=all -v ./build/testMapPointing 54

but the output is not as informative as I had hoped. I tried uploading a zipped version of it here for reference, but it was still too big. The problem is that when it reports lines where problems could be occurring, the lines are too high level, not directing to lines in functions called by other functions. For example:

==3717== 2,388,940 (9,664 direct, 2,379,276 indirect) bytes in 4 blocks are definitely lost in loss record 31,618 of 31,797
==3717==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3717==    by 0x50F078D: UCorrelator::fillStrategyWithKey(FilterStrategy*, char const*) (UCFilters.cc:74)
==3717==    by 0x486E613: makePeakUnnormalizedFlatInterferometricMap(int, double, double, AnitaPol::EAnitaPol, TString, int, int) (eventCorrelator.cc:1178)
==3717==    by 0x10CE9B: addPolTree(int, bool) (testMapPointing.cxx:110)
==3717==    by 0x10E341: main (testMapPointing.cxx:290)
==3717== 
==3717== 2,652,760 (9,664 direct, 2,643,096 indirect) bytes in 4 blocks are definitely lost in loss record 31,619 of 31,797
==3717==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3717==    by 0x50F078D: UCorrelator::fillStrategyWithKey(FilterStrategy*, char const*) (UCFilters.cc:74)
==3717==    by 0x486AF5C: makePeakUnnormalizedInterferometricMap(int, double, double, AnitaPol::EAnitaPol, bool, TString, int, int) (eventCorrelator.cc:743)
==3717==    by 0x10D0C6: addPolTree(int, bool) (testMapPointing.cxx:133)
==3717==    by 0x10E341: main (testMapPointing.cxx:290)
==3717==

I am looking to run now with the above command adding --tool=massif to it, but I don’t think that will automatically go further in depth into which lines of code the leaking can be traced back to.

From this thread it looks like I might want to use --track-origins=yes, or from this thread I would want to use --trace-children=yes? They seem like they may also be applicable to wanting to see more information, but I’m not sure. Have you had experience using these flags before?

eguiraud · November 18, 2020, 9:33am

Hi,
valgrind seems fairly convinced that an allocation happening at UCFilters.cc:74 leaks. What happens at that line?

--track-origins=yes is useful in case of accesses to uninitialized memory, which is not your case.

--tool=massif is a different valgrind tool – rather than looking for leaks or bad memory accesses, it records where all memory allocations come from, and its report tells you exactly what fraction of the memory used was allocated by what line.

If you think there should be more…ROOT in those callstacks, maybe it would help to switch to a debug build of ROOT/your program.

Cheers,
Enrico

jwruss · November 19, 2020, 6:09am

Hi Enrico,

I will bring up UCFilters.cc:74 to my collaborator who manages the repository this source file is in. It looks to be a new declaration which may not be deleted?

To switch to a debug build of ROOT/my program, I presume I will have to update the CMakeLists.txt in this repository. Any suggestions on how to update this file to do that?

Best,
John

Axel · November 19, 2020, 8:35am

Apart from UCFilters.cc:74 I bet there’s memory hogging, not a memory leak. If you’re lucky we can quickly diagnose this using the following trick:

echo 'Root.ObjectStat: 1' > .rootrc

Then edit the sources of your program such that it calls gObjectTable->Print() regularly (#include "TObjectTable.h"). Build it, run it, share the output!

Cheers, Axel.

eguiraud · November 19, 2020, 8:55am

And about debug symbols: for code you build with cmake, you need to change cmake .. (or the corresponding cmake configuration command) to cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo (or Debug instead of RelWithDebInfo if you want more “precise” debugging and don’t care about performance). You can also get debug builds of ROOT e.g. from the LCG releases on lxplus.

What @Axel suggests might be a good alternative to valgrind --tool=massif.

Cheers,
Enrico

jwruss · November 19, 2020, 9:02am

Thank you for these suggestions! I will implement them and get back to you.

Best,
John

jwruss · November 21, 2020, 2:19am

Hi Enrico,

I’m working on implementing @Axel’s suggestion, but I wanted to check with you if the debug setting has already been set in my CMakeLists.txt. Near the top of the file, I have the line set(CMAKE_BUILD_TYPE Debug). That should mean that when I run cmake .., it should be equivalent to cmake -DCMAKE_BUILD_TYPE=Debug ..?

Best,
John