Performance effects in a ROOT-based data acquisition

Hi ROOTers, I would appreciate your help/advice in a topic related to performance.

I have a data acquisition system which reads data from a PCIe digitizer card (via a kernel module provided by the vendor). This program runs on a high performance PC (3990x, 64 cores and 256 GiB). The program uses the vendor’s API to copy data from the digitizer internal buffer to the computer RAM memory.

Now the problem. I did some tests (acquisition of 48 blocks of 256 MiB from the digitizer, until filling 12 GiB of the RAM) and noticed something weird. If I compile my program using “make” and then run the test, I get a data transfer speed that is always the same (48 ms copy time for blocks of 256 MiB). However, when I compile my program with “make -j6” and run the test, then I get sometimes 48ms, sometimes 53 ms, so there is a jitter. This is a big issue because I really need to stay every time below 50ms (5 GiB) so that I do not lose any data from the digitizer (which works at 2.5 GSPS, with 2 bytes per sample).

I was scratching my head on how the compilation cores might be affecting running performance at all, which sounds very weird. But then I noticed something. If I a compile using “make” and then I wait a couple of minutes before running my test, then I obtain the same jitter (48ms-53ms) than if I compile with “make -j6” and run it straight away.

So now I am speculating that, somehow, “make” is warming up one of the cores, which then is working at full speed, compared to “make -j6” that distributes the load among more cores. Does this make sense? Is there a way to solve this to make this jitter disappear (I tried with “nice”, to no avail)? Or does it maybe have to do with kernel task switching / cache / … ?

I am using a TThread for the data acquisition thread. Is there any way to “warm it up” before starting the data acquisition or give it high priority?
Maybe related: multithreading - Do C++ std::threads have a warm-up period? - Stack Overflow

Thanks in advance!


   ------------------------------------------------------------------
  | Welcome to ROOT 6.25/01                        https://root.cern |
  | (c) 1995-2021, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on May 27 2021, 15:48:46                 |
  | From heads/meta_nullptr@v6-25-01-1092-gf684721d6d                |
  | With                                                             |
  | Try '.help', '.demo', '.license', '.credits', '.quit'/'.q'       |
   ------------------------------------------------------------------

Run your DAQ process using “taskset”.
You may also really want to “reserve” one (or more) of your cores during boot (exclusively for this particular process run with “taskset”) using the “isolcpus” kernel parameter to the boot loader during boot or in the GRUB configuration file.
Well, “cset” and / or “numactl” may also be relevant here.

1 Like

Amazing, thanks Wile for the reply! I did not know about the taskset tool. This seems to be working:

taskset --cpu-list 0-3 ./myexecutable

I repeated it 5 times, and I always got 48 ms without jitter. Then, I executed it without taskset, and I got the jitter again.

The funny thing, if I execute it another time with taskset (very quickly after without), then the jitter is still there! Then I wait a couple of minutes, execute it again with taskset, and the jitter disappears. Seems a bit magic to me, because I have no clue how taskset is working, or how it depends on the process executed just before in time, or what the interaction is with the PCI kernel module.

You really need to “reserve” some cores for this process (so that other processes will not be allowed to use them).

2 Likes

To add a bit:

  • in the production system, make sure to turn off Intel Turboboost or other CPU throttling mechanism (usually done from the BIOS settings) to avoid sudden changes in CPU frequency
  • how many cores you use to compile the program won’t change its performance (but compilation flags will, a lot, e.g. make sure you compile the program with -O2)
  • what might cause jittering is the kernel scheduling threads on and off CPUs as needed, the CPU caches (shared among CPUs) getting filled with data used by other programs and similar effects. make sure your benchmark is run on a machine at rest (no other applications running) or at least a machine with similar load as you will have in production. taskset will reduce the amount of context switches by pinning a given process to a given CPU, but as far as I know it won’t prevent the kernel from scheduling other processes on that CPU (scheduling out your process) if it really needs to

About the benchmark itself: there is a program start-up time in which the program loads shared libraries into memory etc., which might be much larger the first time you execute the program (after the first execution the necessary files will be warm in the filesystem cache) – this might change the runtime of your test if it’s very short-lived, but it won’t matter for long-running executions. Also probably obvious but make sure you are measuring wall-clock time and not CPU time.

P.S.
Ah and you can give your process a high priority with nice! The kernel will try to not schedule out processes with a low nice level.

P.P.S.
of course if you can the best solution is to specifically reserve some cores for your task as @Wile_E_Coyote mentioned

1 Like

Thanks Enrico for the insight.

  • It’s an AMD CPU, I’ll check if it has some throttling enabled in the BIOS.
  • Yes, I am compiling with -O2 flag. Actually, the 48 ms are independent of the O2 flag, I checked O0, O1 ,O2, O3, all the same, because that’s a call to an external library from the vendor API (that does the transfer from the PCIe device to the RAM). Obviously the rest of the program runs faster with the O2 flag, but the jitter is not from there.
  • Ok, i understand now better the cause of jittering, thanks. Yes, the machine is at rest, there is just one user logged in, there are some background process from Ubuntu 20 like the update-manager that are taking some tiny memory, but not much compared to the whole RAM.
  • I only benchmark the time of a particular function (transfer memory from PCIe device to RAM), so the load time should not affect it.
  • I am using ROOT’s TStopwatch for benchmarking, as follows: readTime.Start(); device->TransferMemory(...); readTime.Stop(); and then readTime.RealTime() gives me the 48 ms (or 52 ms if I do not use taskset and there is jitter).
  • Before using taskset, I had tried to give a maximum nice priority, but the jitter was still there.
  • I will follow the advice of reserving some cores, thanks for the advices.

Ok, I now reserved in the GRUB isolcpus=0-3. And then run with taskset --cpu-list 0-3.

However, I checked and all my three threads (see TRentrantRWlock thread lock, program freezes - #6 by ferhue) are running on the same CPU, namely number 0.

Is there a way to set the cpu affinity in a TThread class, in a similar way as done with std::thread https://stackoverflow.com/a/57620568/7471760? → I created a PR Allow specifying pthread CPU affinity by ferdymercury · Pull Request #8557 · root-project/root · GitHub

Alright, using my patch, I created now my TThread in CPU 1, whereas the other default threads (I guess “Linux” and the “MainWindow GUI”) are on CPU 0.

Right now, I get very consistent results, all readouts take 48.2 ± 0.2 milliseconds.

However, later, I decided to make an update from linux 5.8.0-53 to 5.8.0-55, and with the newer version, it takes always 51 milliseconds instead (which is not great because whatever is above 50 ms means I will lose data). I tried with -57 and -59 and same effect.

Is there any way I can give feedback to the kernel developers or explore what parts of the PCIe communication changed to see what is delaying the copy_to_RAM function? Thanks for the support.

EDIT: sometimes, I also see very occasionally some 51 ms readout on kernel 53, but less often, so not sure what’s going on.

It seems to me that … the best idea would be to create a kernel device driver that would create a FIFO / pipe (it could simply store the DAQ data in it, with a RAM space for at least 50 “events”, which would be something like 12.5 GiB RAM, i.e., 2.5 seconds of “acquisition”).

Well, if needed, you could “sacrifice,” e.g., 10% of your RAM for the FIFO / pipe. This would be something like 25 GiB RAM so that it could keep up to 100 “events” (256 MiB each), i.e., up to 5 seconds “acquisition” time (a new “event” each 50 ms).

A standard ordinary user’s process (e.g., with a dedicated “thread” in a ROOT application) could then read this FIFO / pipe and process found “events” asynchronously (so, no need for any games with “reserving” cores, though one could increase the “priority” of this process).

1 Like

Thanks for the reply and nice suggestions. The kernel module is programmed by the device vendor (struck.de/), thus I am not sure if I should mess around too much with it. My original idea was to just use their C API, which calls a function ReadMemory(…) that transfers directly 256 MiB from the PCIe to a int* pointer in the PC memory.

On the other hand, I opted to not use an intermediate FIFO buffer, but rather sequential filling of a predefined size RAM buffer, until it’s full. The reason is that my acquisitions during ‘beam on’ are short (always below 30 seconds). Thus 30s*5GB/s = 150 GB, which fits well in the 256 GB RAM. The idea is not to lose any data during the measurement until the preselected buffer size (acquisition time) is full. Processing of the “50ms events” is done offline, so no need to have many threads, but I do want one simultaneous GUI thread that just shows how many events have been acquired in real time and what the readoutTime was, just to keep track that nothing is going wrong during the acquisition.

I also just found out that I was losing 10ms (up to 100ms) each iteration due to the somewhat erratic behavior of the Emit signal system / TThreadTimer, see TThreadTimer behavior · Issue #8582 · root-project/root · GitHub
If I do not find a solution, I am considering switching from TThread to std::thread coupled with some signal slot library, like GitHub - palacaze/sigslot: A simple C++14 signal-slots implementation that might reduce this lag and jitters. That might be hopefully enough to stay within the 50ms per iteration, and more flexible than modifying the kernel device driver.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.