Using files with SWAN/Spark

Hi!

I’m currently trying to create a column to a distributed RDF in Spark with values based on a class from a custom shared library, but the class needs some parameters from a file that’s in my EOS. Opening the file and using the class work perfectly fine when doing it in SWAN without the distributed part, and I can use either a relative path to the file or /eos/user/ path.

The problem comes when using ROOT.RDF.Experimental.Distributed.initialize. The idea is to initialize the parameters for the class first to a global variable (jecl1) that can then be used with the column function (L1_correction()). Here’s an example of the code that’s initialized (classes such as JetCorrectorParameters are distributed with df._headnode.backend.distribute_header() and df._headnode.backend.distribute_shared_libraries() where df is the Spark RDF):

‘’’

def init():

    dist_code = """ 
#ifndef CORRECTIONS_C
#define CORRECTIONS_C
JetCorrectorParameters *l1;
FactorizedJetCorrector *jecl1;
vector<JetCorrectorParameters> v1;

void initCorrections() {
    const char *s1 = "/eos/user/n/ntoikka/SWAN_projects/corrections/Summer19UL18_V5_MC/Summer19UL18_V5_MC_L1FastJet_AK4PFchs.txt";
    l1 = new JetCorrectorParameters(s1);
    v1.push_back(*l1);
    jecl1 = new FactorizedJetCorrector(v1);
    
}

// L1 corrections
ROOT::RVec<double> L1_correction(ROOT::RVec<double> pT, ROOT::RVec<double> eta, ROOT::RVec<double> area, double rho) {
    ROOT::RVec<double> correction(pT.size());

    for (unsigned int i = 0; i < pT.size(); i++) {
        jecl1->setJetEta(eta[i]);
        jecl1->setJetPt(pT[i]);
        jecl1->setRho(rho);
        jecl1->setJetA(area[i]);
        correction[i] = jecl1->getCorrection();
    }

    return correction;
}

#endif
"""

    ROOT.gInterpreter.Declare(dist_code)
    ROOT.initCorrections()

initialize(init)

‘’’

Running this as a cell in SWAN works, but when doing df1 = df.Define("L1correction", "L1_correction(Jet_pt, Jet_eta, Jet_area, Jet_rho)") and running it, the distributed executors can’t find the file s1. The problem there is somewhat clear, in the variable s1 I should use the complete path with root://eosuser.cern.ch/ before the /eos/ part, but if I do so the cell doesn’t run locally in SWAN as the non-distributed part doesn’t recognize that path and cannot open the file. How can I either get SWAN to recognize the root://eosuser.cern.ch/ path or get the executors to use just the /eos/ path?

An alternative solution is to just move the code from the initCorrections() inside the L1_correction() and use the complete path, but this runs quite slow, as there’s a lot of unnecessary file openings.

Thanks :slight_smile:


ROOT Version: 6.27
Platform: SWAN K8s
Compiler: gcc11


May be @etejedor can help

Hi @toicca ,
Thanks for reaching out!

in the variable s1 I should use the complete path with root://eosuser.cern.ch/ before the /eos/ part, but if I do so the cell doesn’t run locally in SWAN as the non-distributed part doesn’t recognize that path and cannot open the file.

This is very surprising to me. The SWAN session should have your credentials set, so that opening files via xrootd (i.e. with the root:// prefix) should work. In general, this is my recommendation also for the functions that you declare via ROOT.gInterpreter.Declare inside the workers.

One piece of missing information is the implementation of your class JetCorrectorParameters and how it opens the txt file, which might give us better insights as to why it cannot use properly xrootd.

Cheers,
Vincenzo

Thanks for the response!

Yes, the problem seems to be with SWAN and the root:// prefix. Somethings work with it (often RDF related), such as

import fnmatch
from os import listdir

files = list(fnmatch.filter(listdir(
    "../../../data/20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/30000"), "*.root"))

chain = ROOT.TChain("Events")
for file in files:
    chain.Add("root://eosuser.cern.ch//eos/user/n/ntoikka/data/20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/30000/"+file)

df = RDataFrame(chain, sparkcontext=sc, npartitions=128)

but if I try to use "root://eosuser.cern.ch//eos/user/n/ntoikka/data/20UL18JMENano_106X_upgrade2018_realistic_v16_L1v1-v1/30000") in files definition Python gives me “No such file or directory”. I can also use the custom classes in pyroot, for example

s = "../Summer19UL18_V5_MC/Summer19UL18_V5_MC_L1FastJet_AK4PFchs.txt"
jetCorr = ROOT.JetCorrectorParameters(s)
v = ROOT.std.vector["JetCorrectorParameters"]()
v.push_back(jetCorr)
l1 = ROOT.FactorizedJetCorrector(v)

works fine (with the relative or /eos/ path, but not with the prefix), so I don’t think that the problem is with the classes.

My SWAN setup is pretty standard with Bleeding Edge software stack. I give some more memory and overhead to the executors than default and redefine spark.executorEnv.ROOT_INCLUDE_PATH as /cvmfs/sft-nightlies.cern.ch/lcg/views/devswan/Thu/x86_64-centos7-gcc11-opt/include/Geant4:/cvmfs/sft.cern.ch/lcg/releases/jsonmcpp/3.10.5-f26c3/x86_64-centos7-gcc11-opt/include:/cvmfs/sft-nightlies.cern.ch/lcg/views/devswan/Thu/x86_64-centos7-gcc11-opt/src/cpp:/cvmfs/sft-nightlies.cern.ch/lcg/views/devswan/Thu/x86_64-centos7-gcc11-opt/include:/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc11-opt/include/python3.9:/cvmfs/sft-nightlies.cern.ch/lcg/latest/R/4.1.2-f9ee4/x86_64-centos7-gcc11-opt/lib64/R/include:/cvmfs/sft-nightlies.cern.ch/lcg/latest/R/4.1.2-f9ee4/x86_64-centos7-gcc11-opt/lib64/R/library/RInside/include:/cvmfs/sft-nightlies.cern.ch/lcg/latest/R/4.1.2-f9ee4/x86_64-centos7-gcc11-opt/lib64/R/library/Rcpp/include, which is to just make everything work (got it from a previous ticket). Could spark.executorEnv.ROOT_INCLUDE_PATH be the problem?

Dear @toicca ,
I think there might be some misunderstanding on the nature of files on EOS. The files that are stored on EOS are in some remote server, that is neither the machine where your SWAN session is running neither any of the workers of the Spark cluster you are connecting to.

When you read a file with an absolute path like /eos/user/..., you are implicitly making use of a feature of EOS, that is the exposure of a POSIX-like interface that mimics a local filesystem. So, when using Python modules that deal with OS files like os or fnmatch, those will be able to see the files under your user namespace because implicitly they are using the POSIX layer provided by EOS. But again, in reality those files are not actually on the machine from where you are running the Python script.

When you write a path to a file on EOS with the root:// prefix, you are accessing the file with the xrootd protocol. Standard Python libraries like os and fnmatch do not support the xrootd protocol, thus the errors you see.

When you develop your own class, either in Python or in C++, if you want to be able to open files with the root:// prefix, you need to develop the I/O part of the class in such a way that it supports the protocol. Refer to their documentation for more info about this.

ROOT natively supports any kind of I/O of ROOT files via the xrootd protocol. This is the reason why doing something like

chain = ROOT.TChain()
chain.Add("root://eosuser.cern.ch//eos/..../myfile.root")

will work. Implicitly the TChain will open a TFile that has an I/O layer capable of reading the file via xrootd.

You should also know that, as far as reading any file from EOS goes, using the root:// prefix, thus using the xrootd protocol, gives better performance then reading the files with the POSIX interface.

Bottom line, I suggest you always use the root:// prefix when reading files from EOS. In your case, you have a .txt file which is not a ROOT file. Probably the easiest thing you can do is download the file through xrootd, via something like xrdcp, inside the initialization function. Thereafter, you will have the .txt file in the working directory and you should be able to use it with your custom class.

Let me know if anything isn’t clear.
Cheers,
Vincenzo

1 Like

Thank you! This was very helpful and cleared things up. I’ll have to look into xrootd.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.