Dear Marta.
Thanks you for looking at the issue and finding a solution(*).
The df._headnode.backend.distribute_unique_paths uploaded the file to the worker node and could verify in the /tmp/spark-…/userFiles… directory.
And with the order of command you suggested (construct RDF, upload and call init) it worked, at least with the simple .h file with the CountCharacters function.
With the real-physics .h, it worked with the ProcessLine instead of Declare ( otherwise I run in the vector problem reported here .
[*] Hrare/analysis/config/functions.h at main · mariadalfonso/Hrare · GitHub
[**] RVec and std::vector in ROOT 6.30/04
I will try the Spark k8s and SWAN-Dask as well.
Maria
P.S. below I type again my notebook snippet for reference
def initSpark():
from pathlib import Path
from pyspark import SparkFiles
print('loadUserCode.h')
localdir = SparkFiles.getRootDirectory()
lib_path = Path(localdir) / "functions.h"
# lib_path = Path(localdir) / "myLibrary.h"
# ROOT.gInterpreter.Declare(f'#include "{lib_path}"')
ROOT.gInterpreter.ProcessLine(f'#include "{lib_path}"')
def makeRDF(files):
.....
elif AF=="cern-spark":
df = RDataFrame("Events", files, sparkcontext=sc, npartitions=NPARTITIONS) # at CERN
df._headnode.backend.distribute_unique_paths(
[
"/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/config/functions.h",
#"/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/myLibrary.h",
]
)
sc.addPyFile("/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/utilsAna.py")
print(sc.environment)
ROOT.RDF.Experimental.Distributed.initialize(initSpark)
....