RDataFrame on Swan-Spark

dalfonso · August 28, 2024, 11:31am

Dear Marta.
Thanks you for looking at the issue and finding a solution(*).

The df._headnode.backend.distribute_unique_paths uploaded the file to the worker node and could verify in the /tmp/spark-…/userFiles… directory.
And with the order of command you suggested (construct RDF, upload and call init) it worked, at least with the simple .h file with the CountCharacters function.

With the real-physics .h, it worked with the ProcessLine instead of Declare ( otherwise I run in the vector problem reported here .
[*] Hrare/analysis/config/functions.h at main · mariadalfonso/Hrare · GitHub
[**] RVec and std::vector in ROOT 6.30/04

I will try the Spark k8s and SWAN-Dask as well.

Maria

P.S. below I type again my notebook snippet for reference

def initSpark():
    from pathlib import Path
    from pyspark import SparkFiles
    print('loadUserCode.h')
    localdir = SparkFiles.getRootDirectory()
    lib_path = Path(localdir) / "functions.h"
#    lib_path = Path(localdir) / "myLibrary.h"
#    ROOT.gInterpreter.Declare(f'#include "{lib_path}"')
    ROOT.gInterpreter.ProcessLine(f'#include "{lib_path}"')

def makeRDF(files):

.....

       elif AF=="cern-spark":
            df = RDataFrame("Events", files, sparkcontext=sc, npartitions=NPARTITIONS) # at CERN
            df._headnode.backend.distribute_unique_paths(
                [
                    "/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/config/functions.h",
                    #"/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/myLibrary.h",                    
                ]
            )
            sc.addPyFile("/eos/user/d/dalfonso/SWAN_projects/Hrare/JULY_exp/utilsAna.py")
            print(sc.environment)
            ROOT.RDF.Experimental.Distributed.initialize(initSpark)

....