RDataFrame entry loss when using multi-thread in version 6.18.00

Dear experts,

Recently I’m moving to the new version 6.18.00 and noticed something strange that didn’t exist in 6.16.00.

ROOT::EnableImplicitMT();
ROOT::RDataFrame df{"NOMINAL", "./merge/3.root"};
df.Snapshot("t", "./test.root", {"jet_pt"});

The number of entries in the new test.root file is different from that in the original 3.root file. This only happens in the version 6.18.00 (not in 6.16.00) when the Implicit MT is turned on, and not for all the files. Actually this file dependence seems to be different between my linux desktop and mac laptop.

I’m sorry that I can’t provide a reproducer here. That file is still in use and I’m not sure if I can make it public.

Thanks!

Best,
Kevin


ROOT Version: 6.18.00
Platform: OsX 10.14
Compiler: clang100


Hi,
thank you for the report! I’m not aware of any race condition in Snapshot that might cause this, nor changes between 6.16 and 6.18 that might cause a regression of such importance.

It would be very, very useful if you could share reproducer+input file privately with me so I can debug what’s going on.

Cheers,
Enrico

I just shared two of the files with you.

The reproducer is pretty much what’s in the original post. But could you please try both of the files separately? It might be machine/platform dependent. And please let me know if you can’t reproduce the problem.

Thanks!

1 Like

Hi,
I’m afraid I can’t reproduce the problem.

Here’s my tentative repro (I tried reading one file at a time as well as both in a chain):

#include <ROOT/RDataFrame.hxx>
#include <TFile.h>
#include <TTree.h>
#include <iostream>

int main()
{
   const char *fname1 = "data/1.root";
   const char *fname2 = "data/2.root";

   {
      TFile f1(fname1);
      const auto entries1 = f1.Get<TTree>("NOMINAL")->GetEntries();
      std::cout << "input file 1: " << entries1 << std::endl;
      TFile f2(fname2);
      const auto entries2 = f2.Get<TTree>("NOMINAL")->GetEntries();
      std::cout << "input file 2: " << entries2 << std::endl;
      std::cout << "total: " << entries1 + entries2 << std::endl;
   }

   ROOT::EnableImplicitMT();
   ROOT::RDataFrame df{"NOMINAL", {fname1, fname2}};
   auto c = df.Count();
   df.Snapshot("t", "test.root", {"jet_pt"});
   std::cout << "read: " << *c << std::endl;
 
   {
      TFile out_f("test.root");
      std::cout << "written on file: " << out_f.Get<TTree>("t")->GetEntries() << std::endl;
   }

   return 0;
}

All numbers printed match. I tested ROOT master (debug build) as well as v6.18 on lxplus (sourced with . /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.18.00/x86_64-centos7-gcc48-opt/bin/thisroot.sh).

Any idea? Without a repro, there is no way to debug/fix this.

Cheers,
Enrico

Hi,

I didn’t try this on lxplus before. I just did it and as you said, it can’t be reproduced on lxplus.

However, I ran root with time and I noticed that the (user + sys) / real time ratio is about 15% (<100%) even with IMT on. Locally on my desktop, it’s >100%. I don’t know if this is because of the hard drive or eos performance. But I suspect this ratio matters if it’s a data race problem.

Although this can be a little annoying, may I suggest you run it locally (preferably on a machine with SSD)? Thank you!

Best,
Kevin

Hi,
It wasn’t clear from my message, but the test on ROOT master was performed on my workstation, using 8 cores and reading from SSD. There shouldn’t be any major differences in behavior between v6.18 and master, bit it would be great if you could check that you also see this behavior with ROOT master.

Indeed on lxplus most of the time is spent waiting for EOS.

Cheers,
Enrico

Hi,

I couldn’t build ROOT master (there was an error in the cmake configuration) so instead I built the latest v6-18-00-patches. The problem still exists.

I guess this problem is really machine dependent… (And your workstation is much more powerful than my computer :smiley:)

Best,
Kevin

Alright, the best I can do is build v6-18-00-patches on my laptop and hope I see it :crossed_fingers:

What does my reproducer print when you compile it and run it?

Do you only see the problem on your laptop? How many cores does it have? SSD or spinning disk?

Pinging @Axel and @dpiparo in case they have an idea for reproducing this on our side.

Cheers,
Enrico

This is the result when I run it on my laptop:

input file 1: 612031
input file 2: 1956041
total: 2568072
read: 2557591
written on file: 2557591

(And even when I comment out the Snapshot and only leave the Count there the “read” number is still the same. Sorry that I didn’t realize this before.)

I think I have this problem also with a version I compiled on Mar 20 from heads/master@v6-16-00-rc1-1339-g2ebe61ce28.

Can this be related to the IO optimization?

This problem is also on an old desktop (results might be slightly different though). I think the cpu is I5-5257U so 2 cores 4 threads. It’s SSD. Thanks!

Best,
Kevin

Alright, I see it on my laptop! Scary! Working on this, please ping me again here if you don’t hear back in a week or so.

Cheers,
Enrico

Thank you!

And I just want to briefly mention that I also posted a related warning issue based on the same input file for the future reference

Hi,
so with some digging it turned out this was a critical bug in TTreeProcessorMT, here is the relevant jira ticket I opened.

This PR should fix. It will be included in the next v6.18 patch release.

Thank you very much for reporting :smile:

I remain at your disposal for any further questions. The warnings are not linked to this bug, and in fact I did not see any warnings when running on the data you shared.

The fix was merged into v6.18-00-patches and master.

You can test it by building ROOT or, from tomorrow, using the nightly builds of ROOT available on cvmfs, e.g. source /cvmfs/sft.cern.ch/lcg/views/dev3/Fri/x86_64-centos7-gcc8-opt/setup.sh (from tomorrow, the current Friday nightly is the one from last week).

Cheers,
Enrico

1 Like

You can also try right now on Linux with the prefix stack I use to test ROOT (not an official service). It has ROOT built nightly and deployed to CVMFS. Works on any Linux distribution with kernel 2.6.32 and above:

$ /cvmfs/sft.cern.ch/lcg/contrib/gentoo/startprefix 
Entering Gentoo Prefix /cvmfs/sft.cern.ch/lcg/contrib/gentoo/linux/x86_64
$ root --version
ROOT Version: 6.19/01
Built for linuxx8664gcc on Jul 11 2019, 00:01:00
From git-r3/HEAD@v6-19-01-427-g0d60f6de6b
1 Like

It’s great to know it has been fixed! Thank you! I’ll test it probably this weekend.

I read jira, and it seems that this bug only affects v6.18.00?

And for that warning issue, @eguiraud could you please try this on lxplus and see if you can reproduce the warning? It works in my case. Thank you!

#include "ROOT/RDataFrame.hxx"

void test4() {
  // merge2 folder contains 1.root and 2.root
  ROOT::RDataFrame df{"NOMINAL", "./merge2/*.root"};
  df.Snapshot("t", "./test4.root", {"jet_pt"});
}

Best,
Kevin

Or grab nightly binaries from https://root.cern/download/nightly/?C=M;O=A So many options :slight_smile:

The bug existed in master and v6.18; it has been fixed in both.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.