RDataFrame is going distributed!

eguiraud · July 27, 2021, 9:02am

Yes! Totally! That’s the goal: keep the analysis code exactly the same and switch between single-thread, multi-thread, distributed execution with minimal configuration boilerplate. RDataFrame is the ideal ROOT interface for that because it’s so high-level, so we can switch how the event loop is run under the hood, transparently.

To add my two cents to Vincenzo’s excellent reply:

Why Spark at all? Because users with a CERN account have access to a Spark cluster via SWAN.

What about HTCondor? As mentioned above Dask allows to attach to an existing HTCondor cluster.

Jupyter notebook vs shell submission: distRDF supports both. We keep an eye on making interactive workflows nice (we expect to see a lot of that in the future), but you can just put the exact same code in a Python script and run it as usual.

What about Ganga? We kept interfaces and backends decoupled precisely because we want to be able to address these kind of future extensions if there is demand for it.

I hope this clarifies a few points. Thank you very much for the feedback, that’s exactly why we advertise our experimental features!