Count entries for large file list using TThreadExecutor

Dear ROOT experts,

There are several cases where I need to open a large number of files to check that they are not corrupt and contain a specific object like a tree or histogram (e.g. for job output), or to count the total number of entries in the TTree per file. The final result should be a map/dictionary between the file name and the number of events (negative if corrupt).

This is typically quite a light & fast task, but often the number of files is large, O(100)–O(1000), and sometimes opening remote files on GRID (via XRootD) can be slow. I would like to speed up this task by parallelizing in a simple C++ macro, which in the end is called from python via ROOT.

I naively implemented a macro using TThreadExecutor, roughly as follows:

  
  auto workItem = [&](const std::string& filename) {
      return getNEventsFromFile(filename, treename);
  };
  
  std::map<std::string, int> nevtsMap;
  std::vector<int> results = pool.Map(workItem,filenames);
  for (size_t i = 0; i < filenames.size(); ++i) {
      nevtsMap[filenames[i]] = results[i];
  }

Attached to this post is a minimal reproducible example. This naive implementation seems to work as expected, however am wondering if I can optimize the threading. For example, one might want to change the number of files per chunk, depending on the number of files, or whether the file are local (fast) or remote (slow).

Could you give some advice on the following points, please?

  1. The worker items only opens/closes a ROOT file and return a single integer. It does not need to port ROOT objects, so is using TThreadExecutor still “okay” for this task, or would it better to use standard C++ tools for multithreading?
  2. How does TThreadExecutor::Map decide the number of chunks by default? Is it the number of available cores, or the number of passed arguments?
  3. Is there a good way to control the chunk granularity with TThreadExecutor in this use case? I naively tried to use MapReduce which allows the user to set the number of chunks, but it requires a reduction function while I still need each integer of each filename.
  4. To overcome the last point above, is there a thread-safe method to fill a map with the integers in the worker item or reduction function?

Thanks!
Izaak


Minimal reproducible example:


ROOT Version: ROOT 6.30/04
Platform: Built for macosxarm64 on Feb 10 2024, 14:55:51
Compiler: Apple clang version 15.0.0 (clang-1500.1.0.2.5)


Hello @IzaakWN,

Here is a TL;DR version of what I would recommend:
Create a vector with zeroes that has the size of the number of files you want to test, e.g.

std::vector<std::size_t> results(nFile, 0);

and then run Foreach with an integer sequence as argument:

  auto workItem = [&](unsigned int i) {
      results[i] = getNEventsFromFile(filenames[i], treename);
  };

It’s crucial that you prefill that vector with zeroes, because you cannot reallocate during the MT execution, but with the above, you should be good to run multithreaded.
If you feel like it, you can choose the chunk size of the ForEach to not overload the task scheduler, but 1000 tasks should still be OK, so chunkSize=1 might work fine.

The longer version:

  1. The worker items only opens/closes a ROOT file and return a single integer. It does not need to port ROOT objects, so is using TThreadExecutor still “okay” for this task, or would it better to use standard C++ tools for multithreading?

It’s okay to use it, but C++ threads or other kinds of parallel_for implementations would be equally fine.

  1. How does TThreadExecutor::Map decide the number of chunks by default? Is it the number of available cores, or the number of passed arguments?

MapReduce has a chunkSize argument, so you can choose, actually. Note, however, that the chunk size is not the number of threads.
Behind the scenes, it uses TBB, and creates nChunks tasks, and these tasks are picked up by the threads as the threads become available. (This is typically what you want; keeping the threads alive, and just assigning new work to them.)
For the number of threads, ROOT and TBB use the number of cores, unless you are in a container with cgroups enabled, or if you initialise TBB to a different number of threads before you start using ROOT, or you asked ROOT for less threads.
See also the docs

  1. Is there a good way to control the chunk granularity with TThreadExecutor in this use case? I naively tried to use MapReduce which allows the user to set the number of chunks, but it requires a reduction function while I still need each integer of each filename.

The Map function has a chunksize argument, but for some reason it’s private. That doesn’t make much sense to me, but well …
You could of course write a reduction function that just concatenates its inputs, so in the end it doesn’t reduce anything at all, but the ForEach above seems easier to me. :sweat_smile:

  1. To overcome the last point above, is there a thread-safe method to fill a map with the integers in the worker item or reduction function?

In general, the answer is that you need a lock or you need to use a different data structure. That’s why I suggested to preallocate a vector, so you can safely write into each index.

1 Like

Hi @StephanH,

Thanks for quick reply and the detailed explanation!

Okay, filling a shared vector with Foreach seems straightforward. I’ll give that a try!

Yes, that’s why I tried switching from Map to MapReduce… It would make sense to me to have the Map method with the nChunks argument public. :sweat_smile:

Cheers,
Izaak

See this feature request. We will discuss if it makes sense to pick this up.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.