RDataFrame is going distributed!

Hi,

Thanks for your reply. The problem I normally have (and given that I have been doing Physics analysis for years, probably it’s the problem that many others in my situation have) is transitioning from a piece of code that does the job locally in our laptop to something that scales up to a cluster.

Ideally, we would do something like:

ROOT.EnableCluster('Condor', queue='short')
df.Filter("var>3")

and the analysis would be done in 30 machines in the computing cluster without us having to do anything extra. In practice we have to take care of telling the software:

  1. Setup the environment.
  2. Where is the script to run and wether to copy input files from one place to another.
  3. The names of the log files where the output should go.
  4. Memory requirements, CPU requirements, etc.
  5. Which machine should process what part of the data. E.g. I have 1000 files of different sizes.Machine 1 should take ~30, but the files have different sizes, so some machines would take 100, others 5 files. I have to figure out the way to split things.
  6. Resubmission. If some files were not processed, I need to figure out which ones and resubmit only for those files.
  7. Checking outputs. Make sure that the events after the processing are the same as the ones in the input. If a selection was involved we need to check if the efficiencies make sense.

and:

A. This requires a lot of code and time and effort to be done correctly, our time.
B. We mostly process stuff in computing clusters through the shell. If we process stuff through a web browser in a Jupyter notebook I assume that we will need more software to be installed by whoever manages the computing cluster in our institutes. That person might not be happy to put all the extra work to get that software in place.
C. Working in a shell vs a GUI with a browser has a lot of advantages. Although It might seem daunting at the beginning, as soon as you learn the commands and get through the learning curve, shells provide a lot of flexibility. And although a GUI looks pretty, I would rather use a tool that gets the job done and saves me time.

Finally, remember this. Our goal (yours too) is to do work that has an impact. If you write code that we do not need/cannot use, your code will be forgotten and your work will be wasted.

Cheers.

1 Like