Using RDataFrame results in large cache memory consumtion

csauer · January 22, 2022, 7:05pm

Dear All,

I’m using RDataFrame to process a large number of files (~1000) with a large number of events per file (basically an entire MC campaign). During processing, a constant, rapid growth of cache memory consumption can be observed. To give you an example, consider the following output that corresponds to the memory configuration prior to the execution of the program

              total        used        free      shared  buff/cache   available
Mem:           251G        4.0G        247G         18M        162M        247G
Swap:            0B          0B          0B

Just a couple of minutes later, the configuration changes dramatically (MT beign disabled)

              total        used        free      shared  buff/cache   available
Mem:           251G        4.1G         41G         18M        206G        246G
Swap:            0B          0B          0B

This continues till the cache consumes all memory which will kill the connection to our Institut’s server. I have no idea why this happens. I’ve read in this forum that the compression level of the branches in the input ROOT files plays a role. Therefore, I took a look at th tree and saw that some branches have a compression level of more than 50! Those files are not generated by myself, so there’s nothing I can do about it. But, even excluding those columns when initializing the RDF does not get rid of the problem.

As you can see, I’m desperate. Any help is highly appreciated :).

Best regards,
Christof

_ROOT Version: 6.20/06
_Platform: x86_64-centos7
_Compiler: gcc8

Wile_E_Coyote · January 22, 2022, 7:55pm

Try to create a small swap file (just a couple of GB long).

eguiraud · January 23, 2022, 4:42pm

Hi,
I think there is a misunderstanding, buff/cache memory can typically be reclaimed by the operating system whenever needed, it’s just caching file contents for faster access (because unused RAM is wasted RAM) but whenever RAM is required by user applications for actual work this cache is freed.

I wouldn’t suggest using a swap file, it will kill performance. I would rather try to understand exactly what the reason is for the connection dying, maybe a timeout in the ssh client/server or similar?

Or maybe it is indeed an out-of-memory problem, but it doesn’t look like from the output of free. You can check the max resident memory usage of the program e.g. with /usr/bin/time.

Cheers,
Enrico

P.S.
I see you are using ROOT v6.20.06. There were large improvements in the amount of memory used by just-in-time compilation in RDataFrame since then, if possible I would suggest using v6.24 or (in a few weeks) v6.26.

P.P.S.
to further clarify, the increase in usage of the filesystem cache is expected, it is a Linux feature: if there is free RAM, whenever you read a file its contents are cached in memory so that a second access will be much faster. If you read ~1000 large files one after the other it makes sense that a lot of content end up in the cache. But as with most caches, that’s just a performance optimization and whenever the operating system needs memory for something else it will evict content from the cache as needed.

system · February 6, 2022, 4:42pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.