Re: Scalability of RDataFrames on 16+ cores

ingomueller.net · August 26, 2021, 2:02pm

This is an interesting point. I was not aware that this number could grow into the thousands. I do believe that many general-purposes are relatively weak in that aspect, though. I expect the solution that most systems offer are temporary tables into which the users manually materializes the result of the expensive computations and then runs “summarization queries” against those.

In the two document-oriented languages from the comparison, JSONiq and SQL++, you can construct complex objects as results. In JSONiq, a joint Q6 could look like this:

let $q6:= (: main computation :)
return {
    "histogram1": (: use $q6 to compute histogram :),
    "histogram2": (: use $q6 again to compute another one :)
}

I don’t know what RumbleDB or SQL++ would do with something like that but general-purpose query planners can find out to automatically materialize reused results in these situations.

It would definitely be interesting to extend our study, both in terms of expressiveness and performance, to that aspect.

From the specs and pricing of m5, you can see that using 100Gbps networking (m5n) is only marginally more expensive than using SSDs (m5d). These SSDs can sustain 6.6GB/s (see earlier post), whereas the bandwidth from networking can reach at least 8GB/s in practice (see this benchmark using c5n instances, which also have 100Gbps networking). Bandwidth can thus not be the problem. I have not measured what the xrootd protocol could deliver in practice, but as I have described in this post, reading from S3 was only at most 2x slower than reading from local SSDs. The reason why I believe the xrootd protocol could be better than S3 is that S3 does not support the HTTPS Ranges parameter for more than one range, so every basket for every column requires a single request (whereas I believe xrootd allows to retrieve several ranges in one request).

Cheers,
Ingo