File read cache fills memory

Chad_Lantz · July 12, 2019, 9:15pm

ROOT Version: 6.16.00
Platform: Centos 7
Compiler: GCC 8.2.0

Hello,

I’m having an issue where in the process of reading my files, they get cached in memory and wind up using all available memory. I’m only experiencing this on Centos 7. I have tried on lxplus, a local Centos 7 machine (local and remote file access) as well as a Centos 7 VM managed by my IT department, all with the same result. Each file gets cached until I have no more memory and processing grinds to a halt.

I have also tried this on Mac and Windows/Ubuntu (WSL). The issue is not present on those machines.

I tried using tree->SetCacheSize(0), but that didn’t seem to work.

I have attached a short script that replicates the problem using a data file in my public folder.

LeakMemory.cpp (986 Bytes)

Thank you for your help,
Chad Lantz

Wile_E_Coyote · July 13, 2019, 8:53am

Try: LeakMemory.cpp (1.5 KB)

I ran this improved macro on a CentOS 7 / x86_64 / gcc 4.8.5 with ROOT 6.18/00 and 6.16/00 from CVMFS and I don’t see any problem. My machine does not have AFS so I downloaded your data file and then I tried:

root -b -l -q 'LeakMemory.cpp("ZDCBeamTestRun171.root")'
root -b -l -q 'LeakMemory.cpp++("ZDCBeamTestRun171.root")'

Chad_Lantz · July 13, 2019, 3:31pm

I just ran it. This still produces the memory issue

Wile_E_Coyote · July 13, 2019, 8:19pm

Can you try the standard ROOT version compiled with gcc 4.8.5 (on CentOS 7).

eguiraud · July 15, 2019, 8:32am

Hi,
I just ran your macro on lxplus using the ROOT v6.16 release distributed with LCG (source /cvmfs/sft.cern.ch/lcg/views/LCG_95a/x86_64-centos7-gcc8-opt/setup.sh). That’s a build with gcc8.2, so it should be pretty much the same as your setup.

Unfortunately I can’t reproduce the issue: memory usage as reported by the macro stays constant, although the numbers reported (28.4/28.6 GB) do not seem correct (time root -l -b -q LeakMemory.cpp reports a peak memory usage of 143616 kB).

I also made the macro compilable (just added a main calls LeakMemory() and a couple of missing headers) and ran the executable compiled with debug symbols through valgrind (valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ./LeakMemory). This is the summary (no clear signs of memory leaks):

==28429== LEAK SUMMARY:
==28429==    definitely lost: 0 bytes in 0 blocks
==28429==    indirectly lost: 0 bytes in 0 blocks
==28429==      possibly lost: 384 bytes in 3 blocks
==28429==    still reachable: 47,185,133 bytes in 69,357 blocks
==28429==                       of which reachable via heuristic:
==28429==                         newarray           : 31,040 bytes in 40 blocks
==28429==                         multipleinheritance: 928 bytes in 2 blocks
==28429==         suppressed: 143,413 bytes in 1,385 blocks

Can you provide step by step instructions to reproduce the issue on lxplus?

Cheers,
Enrico

Chad_Lantz · July 15, 2019, 4:31pm

Hi,

Sorry for the delay. It turns out the information in my header was wrong with regard to the compiler, at least on lxplus. I was originally loading a view compiled with GCC6.2 (source /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_90 x86_64-centos7-gcc62-opt). My local version was compiled by my IT department with GCC8.2.

I should note that I found this issue first on my local centos7 machine, where the problem is more evident, but I found a similar behavior on lxplus.

I use free -h to gather current memory use, run my macro, then use free -h again. What I found is that my buff/cache increases by my file size and free decreases by the same amount. On my local machine, once free gets close to zero my read speed slows dramatically. On lxplus there seems to be much more complex memory management in place, so the problem is less pronounced.

Here is an example from my local machine. You can see the first time I run the macro my buff/cache increases, but the second time I run there is no increase. I know that’s the cache doing its job, but the issue comes when it’s full. I have many of these files to process and I really don’t need to cache them.

[clantz2@phenix-03 JZCaPA]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        860M        4.7G         46M        2.1G        6.3G
Swap:          4.0G          0B        4.0G
[clantz2@phenix-03 JZCaPA]$ root -l LeakMemory.cpp 
root [0] 
Processing LeakMemory.cpp...
Event   100, RAM: 3.0/ 7.6GB
Event 15600, RAM: 4.5/ 7.6GB
root [1]
root [1] .q
[clantz2@phenix-03 JZCaPA]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        862M        3.2G         46M        3.6G        6.3G
Swap:          4.0G          0B        4.0G
[clantz2@phenix-03 JZCaPA]$ root -l LeakMemory.cpp 
root [0] 
Processing LeakMemory.cpp...
Event 15600, RAM: 4.5/ 7.6GB
root [1] .q
[clantz2@phenix-03 JZCaPA]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.6G        864M        3.2G         46M        3.6G        6.3G
Swap:          4.0G          0B        4.0G

I have reproduced the issue with your view, Enrico. First I had to keep logging into lxplus to find a machine that showed some free memory. It seems they are often fully utilized. Once I was on one that showed some free, I executed the following:

[clantz@lxplus725 ~]$ source /cvmfs/sft.cern.ch/lcg/views/LCG_95a/x86_64-centos7-gcc8-opt/setup.sh
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
[clantz@lxplus725 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            28G        7.4G         11G        193M        9.3G         20G
Swap:            9G        405M        9.6G
[clantz@lxplus725 ~]$ root -l LeakMemory.cpp
root [0]
Processing LeakMemory.cpp...
Event   100, RAM:17.2/28.6GB
Event 15500, RAM:19.3/28.6GB
root [1]
root [1] .q
[clantz@lxplus725 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            28G        7.3G        9.3G        193M         12G         20G
Swap:            9G        405M        9.6G
[clantz@lxplus725 ~]$ root -l LeakMemory.cpp
root [0]
Processing LeakMemory.cpp...
Event 15500, RAM:19.3/28.6GB
root [1] .q
[clantz@lxplus725 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            28G        7.2G        9.3G        193M         12G         20G
Swap:            9G        405M        9.6G

I’m not sure what these warning messages are, but the behavior is similar. A buffer increase on the first run, and no increase on the second. I’ve also noticed that lxplus seems to free memory on demand once it has been filled, so filling the buffer doesn’t really have a negative effect.

eguiraud · July 15, 2019, 4:44pm

That’s expected.

That’s not – if the file is not memory mapped or if the kernel needs to swap out pages to make space to new ones, processing will be slower, but it should not be dramatically slower. It might be due to a specific configuration of your local machine.

As far as I can tell the behavior on lxplus is not problematic or pathological.

Let’s ping @Axel and @pcanal (both currently traveling) in case they want to add a more authoritative opinion.

If you believe the slowdown you see is a bug in ROOT and you have a reproducer that we can play with, please open an issue on jira. Otherwise I can’t think of knobs that can be turned at ROOT-level to mitigate the problem: afaik file caching is performed by the kernel no matter how you read and transparently w.r.t the reading process.

Cheers,
Enrico

Chad_Lantz · July 15, 2019, 4:52pm

Hi Enrico,

Thank you. That was actually where I was headed next. Because lxplus is clearing freeing memory on demand, caching isn’t really an issue there. I just don’t have the expertise to know for if this was something I could take care of in ROOT. I’ll contact my IT department to see if they can manage our memory similar to the way lxplus does.

Thank you Enrico and Wile_E_Coyote for your help.

Wile_E_Coyote · July 15, 2019, 5:03pm

It seems to me that you have a very slow connection to your “disks”. So, as long as your data file is already in the cache (in RAM), your program runs fast but, when your operating system needs to “download” your data file (into its RAM cache), you suffer from the “raw performance” of your connection.

Note: if you use an AFS client on your machine, make sure that its own cache is large enough to keep your data files (otherwise it will need to re-download them again, each time you need them). Maybe you should inspect the /etc/openafs/cacheinfo file (the default value is even below 49 MB, see man cacheinfo and check also if the AFS client is configured to use the disk cache or the memory cache).

system · July 29, 2019, 5:12pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.