This is a quick follow-up on the discussion on Scalability of RDataFrames on 16+ cores. Since then, we have rerun many of the experiments and added some new ones. In particular, we have rerun all experiments with ROOT, this time with version 6.24.02. With that version, thanks to your fix, the scalability problems that we had run into originally are completely gone. In fact, (strong) scalability is almost ideal, at least down to a low single-digit number of seconds (see Figure 1). This makes it clearly the fastest self-managed system by quite a margin. Only on very large data sets (see Figure 2), the cloud-based systems, which are scale-out systems able to use many more resources, are able to get lower running times.
We also added the RDataFrames interface to the comparison of query languages/query interfaces. We are aware that there are very good reasons for the current design, but it may still be interesting for you to see how some of the typical HEP query patterns are expressed in general-purpose query languages and how we think that compares to RDataFrames.
The updated version of the study is now available on arXiv. Any thoughts or feedback is highly appreciated.
Thank you for letting us know about these results, and congrats to @eguiraud for his work on optimizing the performance of RDF!
Only on very large data sets (see Figure 2), the cloud-based systems, which are scale-out systems able to use many more resources, are able to get lower running times.
I understand the results here cannot be directly compared because the tests for the different approaches are running on different HW (which is normal since you are comparing multiple clouds). Also in serverless products such as AWS Athena you don’t really know on which resources your query is running. Nevertheless, I think the value of those results is to see how far these systems can go in terms of performance and scalability for large datasets. It would be nice to compare one day with distributed RDataFrame (now experimental)!
This is great input! We were aware of PyRDF but not about the fact that it/something similar was on the way to make it part of the official ROOT distribution. Indeed, it would be interesting to extend our analysis to scale-out experiments. All systems from our comparison support that; ROOT was the only one with no official/production-ready support. Our experiments are fully automated (and public) and the driver scripts contain at least stubs for distributed set-ups, so extending them should be relatively easy.
I do think that comparing on-premise systems with QaaS systems (Query-as-a-Service, such as BigQuery and Athena) has interesting insights, though. First, it illustrates how easy scale-out is with those systems – the user does not need to do anything. The other one is that a QaaS system has the potential to be cheaper than a self-hosted solution, at least, if the self-hosted resources are not shared with other users. With the QaaS pricing model, you pay exactly for the resources you are using; this may be more than for the pure execution time on your own resources but, unlike those, idle time is completely free. So if you don’t make very efficient use of your own resources, QaaS may be cheaper overall. Of course, there are practical reasons that may speak against BigQuery and Athena, but they may illustrate the benefits that a self-built QaaS solution in a private cloud could have…
Hi Ingo,
I read the new version of the paper – first of all congratulations and kudos to all authors, it’s an impressive amount of work. And thank you for addressing our comments about typical HEP dataset sizes!
The competitive performance of RDF is especially interesting considering that optimized RDF for those benchmarks is around 3x faster than what you use (but also more verbose and clunky).
However we plan to reduce that gap in the future (jit-compiling code with optimizations that are currently not applied) so RDF times will go further down in 1 or 2 ROOT releases.
Two questions:
about the split of Q6 in Q6a and Q6b: real HEP analyses often produce not 1 or 2, but dozens (up to thousands) of output histograms. One important feature of RDataFrame (or HEP analysis frameworks in general) is that it makes it simple to produce all the histograms in a single pass over the data. The split of Q6 in Q6a and Q6b makes me think that some of the other systems cannot. If their scaling with the number of output histograms (or output data aggregations in general) is much worse than that of RDataFrame that is a major drawback, and something a comparison paper such as this should probably mention
the paper mentions
“For RDataFrames, we put the input files on the local disk, which gives a similar performance as the typically used xrootd network storage protocol”
it’s surprising for me that reading from SSD is as fast as reading over the network, am I missing something, or do you actually expect similar throughput? Especially at high core counts and with the simplest queries I would expect network I/O to choke before SSD I/O does.
We were aware of PyRDF but not about the fact that it/something similar was on the way to make it part of the official ROOT distribution
We abandoned the name of PyRDF and now we just call it distributed RDataFrame or DistRDF But it’s the same thing, and it made it into ROOT in experimental mode for now:
It has backends for Spark and Dask, perhaps Dask could be the easiest option to set up on a set of VMs in the cloud. There is also a backend for AWS Lambda but it’s still in “incubation” mode (didn’t make it yet into ROOT).
I do think that comparing on-premise systems with QaaS systems (Query-as-a-Service, such as BigQuery and Athena) has interesting insights, though. First, it illustrates how easy scale-out is with those systems – the user does not need to do anything.
I agree, there must be a whole team of engineers behind these serverless solutions so that they make the right decisions to scale your queries. It’s a little bit as if we created an RDataFrame cloud service where you just give us the main program and the dataset you’d like to process and we decide on how many resources it will run and how many partitions of the input dataset to create (which is highly non-trivial).
The other one is that a QaaS system has the potential to be cheaper than a self-hosted solution, at least, if the self-hosted resources are not shared with other users. With the QaaS pricing model, you pay exactly for the resources you are using
I believe it really depends on the use case. If you use them extensively, these fully-managed products can become quite expensive in the end, and you might be better off with your own cluster of VMs that you manage - and you can set up auto-scaling rules for those, so that you don’t pay all the time for unused resources anyway.
Of course, there are practical reasons that may speak against BigQuery and Athena, but they may illustrate the benefits that a self-built QaaS solution in a private cloud could have…
I didn’t find in the paper any reference to AWS Redshift so I was wondering whether you tested it? AWS Athena is mostly for short exploratory work, rather lightweight queries (e.g. log analysis). On the other hand, Redshift is the OLAP product (columnar store) of AWS and it is potentially much faster. It’s not serverless, true, but might be worth trying too unless there is some limitation.
This is an interesting point. I was not aware that this number could grow into the thousands. I do believe that many general-purposes are relatively weak in that aspect, though. I expect the solution that most systems offer are temporary tables into which the users manually materializes the result of the expensive computations and then runs “summarization queries” against those.
In the two document-oriented languages from the comparison, JSONiq and SQL++, you can construct complex objects as results. In JSONiq, a joint Q6 could look like this:
let $q6:= (: main computation :)
return {
"histogram1": (: use $q6 to compute histogram :),
"histogram2": (: use $q6 again to compute another one :)
}
I don’t know what RumbleDB or SQL++ would do with something like that but general-purpose query planners can find out to automatically materialize reused results in these situations.
It would definitely be interesting to extend our study, both in terms of expressiveness and performance, to that aspect.
From the specs and pricing of m5, you can see that using 100Gbps networking (m5n) is only marginally more expensive than using SSDs (m5d). These SSDs can sustain 6.6GB/s (see earlier post), whereas the bandwidth from networking can reach at least 8GB/s in practice (see this benchmark using c5n instances, which also have 100Gbps networking). Bandwidth can thus not be the problem. I have not measured what the xrootd protocol could deliver in practice, but as I have described in this post, reading from S3 was only at most 2x slower than reading from local SSDs. The reason why I believe the xrootd protocol could be better than S3 is that S3 does not support the HTTPS Ranges parameter for more than one range, so every basket for every column requires a single request (whereas I believe xrootd allows to retrieve several ranges in one request).
@etejedor: Thanks a lot for elaborating! I agree with your discussion on potential advantages of QaaS vs VMs: it depends on the concrete case. (BTW, BigQuery also has the option to choose flat-rate pricing, which is yet another option that is cheaper for frequent use.)
AWS Redshift is indeed a promising candidate. Unfortunately, we became aware of it only late in the study. Apart from the potential performance benefits that you mention, it has also an interesting query language: PartiQL (what a fitting name for this context!), which is based on SQL++, which in turn fared pretty well in our comparison. So I agree: another direction in which our work could be extended