RDataFrame is going distributed!

vpadulan · July 26, 2021, 8:49am

Hi,
The experimental distributed RDataFrame Python module aims at making it easy to take an existing RDataFrame application and distribute it to a cluster. The module has interactive workflows in mind (think about running an entire analysis from a Jupyter notebook without needing to wait in a queue or processing partial results from the distributed tasks in a separate step). This kind of workflow is quite common in industry, Apache Spark and Dask are prime examples of software tools in this context. Naturally, it is not completely aligned with the job-queue system proposed by HTCondor or Slurm. While Apache Spark completely neglects that model, Dask tries to be a bit more flexible with its dask-jobqueue module which was mentioned in a previous comment.

That being said, distributed RDataFrame is designed with modularity in mind, so that we can support running on Spark, Dask or really any computing framework, as long as we can use its concrete API to call our abstractions (i.e. calling the RDataFrame graph of computations on a range of entries from the original dataset).

We already have in mind to test this new module with Dask + HTCondor, stay tuned for that

Cheers,
Vincenzo