Macro stopped without error when looping over large number of files/events using TChain

Wile_E_Coyote · June 25, 2019, 6:27pm

You must first make sure that ACLiC is able to precompile it:

root [0] .L analysis.C++

You cannot use “helgrind-root.supp” instead of “valgrind-root.supp” (you need to find this file, maybe it is not in the “root-config --etcdir” subdirectory but somewhere else).

weishi10141993 · June 25, 2019, 6:39pm

Doing this

gives me these
45%20PM

Wile_E_Coyote · June 25, 2019, 6:41pm

So, it seems that your system administrators do not allow you to access “/cvmfs/sft-nightlies.cern.ch” (you will need to talk to them).
Maybe @axel knows another place which does not need “nightlies”.

weishi10141993 · June 25, 2019, 7:54pm

Looks like so, I can see the dir that Axel pointed me to there, but no access, contacted admin now:

At the moment compiling the macro

under 5.34/38 are reporting many errors, such errors are not seen when compiling under 6.10/09. I also can’t find any valgrind-root.supp so far using:
find / -name “valgrind-root.supp”

Axel · June 26, 2019, 6:04am

Phew, very difficult. Forget 5.34. Instead, even without help from your admins you should be able to run

/cvmfs/sft.cern.ch/lcg/contrib/gentoo/startprefix

and then simply use ROOT - it’s 6.19/01. That should work as long as you only depend on ROOT and not on other libraries.

Cheers, Axel.

weishi10141993 · June 26, 2019, 6:48pm

Hi,

So the ROOT 6.19 is working now. But not valgrind, it says I need to install glibc’s debuginfo package on this machine?
40%20PM

Checking the distribution under this environment below, what debug package should I install under this distribution?

gentoo cutflow_macros $ lsb_release -a
LSB Version: n/a
Distributor ID: Gentoo
Description: Gentoo Base System release 2.6
Release: 2.6
Codename: n/a

weishi10141993 · June 27, 2019, 12:19am

Ok, now I copied all files to CERN cluster (lxplus7.cern.ch) where CentOS 7 is running. I used ROOT v6.16.00 by doing ‘. /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.16.00/x86_64-centos7-gcc48-opt/bin/thisroot.sh’

By the time I post this, the macro is still in the process of looping over events (at ~7million), but here is the current seen output: Valgrind_Message_Summary_ROOT_v6.16_00.txt (621.5 KB)

I can see there are still many spurious valgrind warnings, but are there any hints from your perspective?

Axel · June 27, 2019, 4:56am

This seems to be the same screenshot as your previous one - is that intentional?

OK so copying is the way to go. You should have taken v6.18 - it has a much reduced number of valgrind warnings. Apologies this is such a rocky ride - it really shouldn’t be: we failed to address the valgrind warnings for way too long.

Once you have the warnings available with v6.18 please post them and we should be able to point you to where things go wrong.

Cheers, Axel.

Wile_E_Coyote · June 27, 2019, 6:32am

I don’t that think you can get the “valgrind” running on this “gentoo prefix” yourself. The error that you get suggests that there are vital components missing in this setup (and you cannot install / add them on your system, they would have to be installed in the “gentoo prefix” repository).

So, on CentOS 7, simply try the standard ready-to-use binary distribution, provided by the ROOT team, which you just need to download and unpack.

weishi10141993 · June 27, 2019, 8:45pm

Sorry, I meant to put the txt file instead of png, now updated in post #27: Macro stopped without error when looping over large number of files/events using TChain - #27 by weishi10141993

weishi10141993 · June 27, 2019, 8:46pm

The university cluster admin gave me access to a test cluster where CentOS 7 is running, and with ROOT v6.18.00, here is the output: Valgrind_Message_Summary_ROOT_v6.18_00_University_Cluster.txt (469.3 KB)

Interestingly, the same program running on lxplus7 (also with same ROOT v6.18.00) seems to be able to loop over all 46 million events, here is the valgrind output from lxplus7: Valgrind_Message_Summary_ROOT_v6.18_00_CERN_LXPLUS.txt (408.0 KB)

Anyway, it’s more interesting to know why the macro can’t finish on the university cluster, since all input files are usually stored there.

Axel · June 28, 2019, 9:17am

Hi,

See https://stackoverflow.com/questions/15697410/sigxcpu-error-in-c-program : it’s your uni cluster saying “Weishi, you have used too much CPU time and we will now terminate your program!”

In the end it’s not a memory problem - I forgot this as a possible cause. But at least now you have the proof that your analysis does everything perfectly fine when it comes to memory!

Cheers, Axel.

Wile_E_Coyote · June 28, 2019, 9:49am

In order to make sure that it is really just a “cpu time” problem, check your current “hard and soft limits”:

[bash]$ ulimit -H -a
[bash]$ ulimit -S -a

[tcsh]$ limit -h
[tcsh]$ limit

weishi10141993 · June 28, 2019, 6:37pm

This is what it says:
02%20PM
So the hard cpu time limit is 1500s (=25mins), soft limit is 1200s (=20mins).

weishi10141993 · June 28, 2019, 7:11pm

Are there any relevant system notification/header file in ROOT I can add in my program to indicate/notify this kind of CPU limit/memory issue?

Currently the program just stops without any error message.

system · July 12, 2019, 7:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.