Re: Scalability of RDataFrames on 16+ cores

etejedor · August 26, 2021, 1:37pm

Very interesting discussion!

We were aware of PyRDF but not about the fact that it/something similar was on the way to make it part of the official ROOT distribution

We abandoned the name of PyRDF and now we just call it distributed RDataFrame or DistRDF But it’s the same thing, and it made it into ROOT in experimental mode for now:

It has backends for Spark and Dask, perhaps Dask could be the easiest option to set up on a set of VMs in the cloud. There is also a backend for AWS Lambda but it’s still in “incubation” mode (didn’t make it yet into ROOT).

I do think that comparing on-premise systems with QaaS systems (Query-as-a-Service, such as BigQuery and Athena) has interesting insights, though. First, it illustrates how easy scale-out is with those systems – the user does not need to do anything.

I agree, there must be a whole team of engineers behind these serverless solutions so that they make the right decisions to scale your queries. It’s a little bit as if we created an RDataFrame cloud service where you just give us the main program and the dataset you’d like to process and we decide on how many resources it will run and how many partitions of the input dataset to create (which is highly non-trivial).

The other one is that a QaaS system has the potential to be cheaper than a self-hosted solution, at least, if the self-hosted resources are not shared with other users. With the QaaS pricing model, you pay exactly for the resources you are using

I believe it really depends on the use case. If you use them extensively, these fully-managed products can become quite expensive in the end, and you might be better off with your own cluster of VMs that you manage - and you can set up auto-scaling rules for those, so that you don’t pay all the time for unused resources anyway.

Of course, there are practical reasons that may speak against BigQuery and Athena, but they may illustrate the benefits that a self-built QaaS solution in a private cloud could have…

I didn’t find in the paper any reference to AWS Redshift so I was wondering whether you tested it? AWS Athena is mostly for short exploratory work, rather lightweight queries (e.g. log analysis). On the other hand, Redshift is the OLAP product (columnar store) of AWS and it is potentially much faster. It’s not serverless, true, but might be worth trying too unless there is some limitation.