ROOT 6.22 very slow startup

Hello!

I am having a trouble with ROOT 6.22 on a remote machine, it is very slow to startup (by command line and when reading a root file with a code compiled with ROOT libraries). There was some old topics about similar issue with ROOT 6, but no clear solution was provided in these ones.

I’m using ROOT 6.22 from Git (using the 6.22 branch) and I compiled using cmake 3.18.1 with the following options:
set(_root_enabled_options asimage builtin_afterimage builtin_cfitsio builtin_clang builtin_fftw3 builtin_ftgl builtin_gl2ps builtin_glew builtin_llvm builtin_lz4 builtin_tbb builtin_vdt builtin_xrootd builtin_xxhash builtin_zstd clad dataframe exceptions fftw3 fitsio gdml http imt mathmore mlp opengl pyroot roofit runtime_cxxmodules shared ssl tmva tmva-cpu tmva-pymva spectrum vdt x11 xml xrootd)

The (remote) machine on which I compiled is a linux 64bit based server, but I don’t know which OS is used (and I don’t have root access).

When using root-6.22:

time root -l -q

real 0m40.167s
user 0m0.149s
sys 0m0.674s

time root -l -q <root_file.root>

Attaching file <root_file.root> as _file0…
(TFile *) 0x3421970

real 1m58.145s
user 0m0.456s
sys 0m1.288s

However, if I use root-3.34:

time root -l -q
root [0]
real 0m0.506s
user 0m0.060s
sys 0m0.070s

time root -l -q <root_file.root>
root [0]
Attaching file <root_file.root> as _file0…
Warning in TClass::TClass: no dictionary for class ROOT::TIOFeatures is available

real 0m11.889s
user 0m0.218s
sys 0m0.258s

The root file I used for this time test was made with ROOT6.22.
Is there any clues on what I could try to do to improve the situation?

Thanks,
Guillaume

If it’s useful, here are some additional information:

  • I have 5 directories in ROOT_INCLUDE_PATH (all of them exist).
  • ROOT5.34 and ROOT6.22 build directories are on the same disk.
  • I tried to compile ROOT6.22 with and without python3 installed, this had no impact.

Hi,

Something is terribly broken. Please attach the (zipped!) for-axel.txt from

strace -o for-axel.txt -r root.exe -l -q

(Yes, the .exe is intentional, on Linux.)

And please share the output of lsb_release -a.

Cheers, Axel.

Hi Axel,

Thank you, I’m producing the strace, I will post it later. In the mean time, the OS version is:

LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.2.1511 (Core)
Release: 7.2.1511
Codename: Core
1 Like

Here is the strace (I had a look at it, there is a huge bunch of “(No such file or directory)”, but I don’t really know what should be expected).

For this strace I modified a bit my configuration (to remove Geant4, and some others work directories from LD_LIBRARY_PATH, PATH, and ROOT_INCLUDE_PATH). ROOT_INCLUDE_PATH include now only one directory (my home directory).

for-axel-2.zip (87.9 KB)

I tried also to generate the strace in a configuration without declaring anything in ROOT_INCLUDE_PATH (in this case, making the strace was much faster, but the time to call root-6.22 doesn’t change much ~40sec):

for-axel-3.zip (87.9 KB)

Thanks,
Guillaume

Thanks. What does time ls -l /home/pronost/cassert show?

And sorry - wrong strace flag :frowning: Can you re-run with strace -T -ttt -o for-axel.txt -T root.exe -l -q

Ok, I will reproduce strace then.
For the ls -l:

time ls -l /home/pronost/cassert
ls: cannot access /home/pronost/cassert: No such file or directory

real 0m0.003s
user 0m0.000s
sys 0m0.001s

Here are the strace (with same configurations than the one showed above) :
for-axel-2.zip (124.0 KB)
for-axel-3.zip (125.2 KB)

This time the strace production was very fast, I’m not sure why. time root -l -q time was also slightly better immediately after the strace (2 sec and 7 sec), but is now back to 48 sec (configuration 2) and 1min15sec (configuration 3) one minute after. The main difference between now and before is that I restart my computer just before connecting to the remote machine (I don’t think this has any impact however).

Hi Axel, do you have any idea from the strace?

No, if it works with the strace but not without then it’s really hard to tell! We need strace on a slow run. Next option: root.exe -l -q -e 'gDebug=7 and note where it “sits and waits” (I need the context before and after). Just press Enter a couple of times to “mark” where there was no progress. Maybe that gives a hint…

Hi Axel,

Ok, thank you for your help. Indeed the issue is likely to come from some local situations (disk usage for example). I tried to generate some straces, the time they take vary largely, some were very fast (~4 sec), some were slower (~15sec, ~50sec, ~75sec).

Globaly, when it’s slow, everything is slower, but it seems the slowest parts are:

1608251712.907689 mprotect(0x7f26dcd2e000, 4096, PROT_READ) = 0 <0.000013>
1608251716.845676 mprotect(0x7f26dfb25000, 2215936, PROT_READ) = 0 <0.000033>
1608251717.926582 brk(0) = 0x2539000 <0.000011>

1608251723.482701 stat(“/disk02/usr6/pronost/software/root-6.22.00-build/etc//cling/lib/clang/5.0.0/include/stdint.h”, {st_mode=S_IFREG|0644, st_size=23387, …}) = 0 <0.000035>
1608251725.162676 futex(0x7f26dfd6ed9c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000011>

1608251744.917222 brk(0x3501000) = 0x3501000 <0.000011>
1608251752.825368 brk(0) = 0x3501000 <0.000012>
1608251752.825420 brk(0x3522000) = 0x3522000 <0.000014>
1608251761.189044 stat(“/home/pronost”, {st_mode=S_IFDIR|0755, st_size=8192, …}) = 0 <0.000020>

1608251761.442344 read(3, “# This file contains a set of RO”…, 4194304) = 906 <0.031608>
1608251763.112581 read(3, “”, 4194304) = 0 <0.000077>
1608251763.620644 stat(“input_line_10”, 0x7ffd6c2560b0) = -1 ENOENT (No such file or directory) <0.000011>
1608251768.466338 brk(0) = 0x3656000 <0.000011>
1608251768.466389 brk(0x367e000) = 0x367e000 <0.000015>
1608251769.423531 brk(0) = 0x367e000 <0.000011>
1608251769.423578 brk(0x369f000) = 0x369f000 <0.000014>
1608251771.151592 getcwd(“/home/pronost”, 8192) = 14 <0.000015>

1608251775.668213 munmap(0x7f26e274f000, 4096) = 0 <0.000022>
1608251781.977775 stat(“/disk02/usr6/pronost/software/root-6.22.00-build/lib/libMathCore.so”, {st_mode=S_IFREG|0755, st_size=2581654, …}) = 0 <0.000066>

1608251781.979890 rt_sigaction(SIGINT, {0x7f26e1ee91f0, , SA_RESTORER|SA_RESTART, 0x7f26e11f6370}, {SIG_DFL, , 0}, 8) = 0 <0.000010>
1608251786.299322 ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo …}) = 0 <0.000014>

strace_slow_75s.zip (131.8 KB)

Anyway, I will check will the computer team if they have any idea about what could be the issue.

It seems the issue was linked to the file system used on the disk in which I installed ROOT6, moving back ROOT6 directory in my home directory seems to solve the situation. I don’t really understand why ROOT5 was working fine however.

Which file system was that?

And btw, none of the system calls shows high running times (the numbers in <...>). Whatever happened in the 8 seconds between

1608251752.825420 brk(0x3522000)        = 0x3522000 <0.000014>
1608251761.189044 stat("/home/pronost", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0 <0.000020>

is impossible to tell from the strace log. It might be worth looking into the kernel logs, but I understand you don’t have access to those.

Sorry, for the late answer, I think the file system used on these disks is “Lustre”. I’m not familiar with this, so I don’t know if it’s a known issue or not.

For a really very long time, I faced a similar problem on a SLURM batch cluster (based on CentOS 7 / x86_64) with all my relevant files (libraries, executables and data) on Lustre.
The administrators finally said that the problems appeared for some specific configurations of jobs running on batch nodes (they did not explain it further but they said that the problems were “intensified” by the fact that my libraries and executables were on Lustre).
It seemed to them that the underlying problem existed somewhere on the border between the Linux kernel and the Lustre client and was related to the Linux virtual memory management.
They were trying to find some permanent solution and, as a temporary fix, they advised me to add the “--exclusive=user” option to the “sbatch” command (or to add the “#SBATCH --exclusive=user” directive in the job script). I actually never tried it as, since that time, I had not needed to work there again.

Indeed the issue seems similar, but in my case it’s affecting everything, not only jobs.

The administrators advised me that if the binaries need to access many shared libraries and/or frequently, I would better move binaries and shared libraries to my home directory (which is using a standard file system). If the shared libraries are few and not loaded frequently, it could be fine to let them on the lustre filesystem. (I guess this is the main explanation for why ROOT5 was behaving fine and not ROOT6)

Note that a “job” is just some shell script / executable running on some remote machine so, also in my case you can say it was “affecting everything” (on some specific node in some specific time).

These problems were not generated by my “jobs”. I had many cases when multiple (my) “jobs” were running simultaneously on some node without problems.
However, if one of my “jobs” had problems on some node then my another “jobs”, which were started on the same node in the same time, were also exhibiting problems (my “jobs” running on another nodes were not affected).

So, these problems depended on the actual usage of particular remote nodes.

I think it is the same in your case (e.g. in one of the previous posts here, you reported that sometimes the output from strace was very fast).

Also in my case, the administrators were advising me to move my libraries and executables to my home directory. Unfortunately, the disk quota (put by them) on this file system did not allow it (and it was not really meant for “intensive” operations).

BTW. In my case, for a very long time, the administrators were also repeatedly trying to blame ROOT for these problems. My “jobs” were actually reprocessing some ROOT files, so they were reading / writing ROOT files from / to Lustre. I always objected as the sizes of these files were not really big (and for each “job”, just one file was read and another one written and any intermediate files were put on a RAM disk scratch file system).

I understand, indeed this is similar.

For my case, the administrators offered to increase my disk quota so I could move the ROOT binaries/shared libraries there, so I don’t have any troubles now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.