Process stuck when using XROOTD + Streaming from the grid + RDataFrame

Hi,

The code below:

import ROOT

from XRootD              import client

filename='root://xrootd.echo.stfc.ac.uk/lhcb:user/lhcb/user/a/acampove/GangaFiles_11.18_Wednesday_27_October_2021/2011_skimmed.root'

df = ROOT.RDataFrame('gen/truth', filename)
df=df.Define('x' , 'B_plus_PX')

is a minimal reproducer that uses data in the Grid, therefore you might need a grid certificate to run it. It seems to get stuck when this particular combination of:

  • XROOTD line above.
  • Data in the grid.
  • Dataframe

are used. This caused our grid jobs to just wait, when jobs get stuck for around an hour (i.e. the process waits in status S) these jobs get killed and the log files do not get saved. That made this particular problem extremely time consuming to solve.

Now that we know the source of the problem, we will just drop the XROOTD line; thus, a solution is not urgently needed. In any case we thought of letting you know.

Cheers.


Please read tips for efficient and successful posting and posting code

_ROOT Version: 6.24/06
_Platform: x86_64-centos7-gcc10-opt
_Compiler: gcc10


Or just setup LCG_101 with x86_64-centos7-gcc10-opt

Hi @rooter_03 ,
thank you for the report, this is not a known issue.
If I understand correctly removing from XRootD import client is a workaround? That’s very surprising.

Also do you need RDataFrame, or is TFile enough to reproduce the isssue (RDataFrame does not do anything special, it uses TChain and TFile to access the files under the hood)? For example, do these work or hang?

import ROOT
from XRootD              import client

filename='root://xrootd.echo.stfc.ac.uk/lhcb:user/lhcb/user/a/acampove/GangaFiles_11.18_Wednesday_27_October_2021/2011_skimmed.root'

f = ROOT.TFile.Open(filename)
t = f.Get('gen/truth')
print(t.GetEntries())

or

import ROOT
from XRootD              import client

filename='root://xrootd.echo.stfc.ac.uk/lhcb:user/lhcb/user/a/acampove/GangaFiles_11.18_Wednesday_27_October_2021/2011_skimmed.root'

c = ROOT.TChain('gen/truth')
c.Add(filename)
print(c.GetEntries())

Cheers,
Enrico

Hi @eguiraud ,

Surprisingly, TFile is enough to reproduce the issue, as long as you do not call TFile::Close. If this function is called, the problem does not happen. The problem does happen with TChain too. Maybe it has to do with the file not been closed explicitly.

A colleague found out that the problem goes away if the lines:

import ROOT

from XRootD              import client

are swapped to:

from XRootD              import client

import ROOT

Cheers.

Alright, thanks.
@chrisburr mentioned they already know the cause and will open a bug report about this soon – we’ll keep you posted.

Cheers,
Enrico

EDIT:

here it is: Deadlock in Python bindings when used with PyROOT · Issue #1552 · xrootd/xrootd · GitHub

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.