Best practices for reading ROOT files from S3

ingomueller.net · August 6, 2021, 7:28am

Dear ROOT experts,

I am currently carrying out performance benchmarks with ROOT (based on the opendata-benchmark as previously discussed here) and I am considering reading the input files from (Amazon’s) S3 instead of local disks. The main reason is that I want to process files that are larger than fit into the local disks of the machines; another reason is that all other systems in the comparison process their input directly off remote storage as well.

I am now wondering what are the best practices in terms of performance for that setup. Unfortunately, my current attempt leads to significantly worse performance (3x worse for the simplest query, +20% worse for the most complex ones).

I guess one parameter is the protocol. For now, I am opening the files via s3https://endpoint/bucket/file.parquet, but I I saw that other protocols exist. Is there any difference in terms of performance? I have briefly compared s3https with s3http and did not see a significant difference.

Another set of parameters might be related to the ROOT files. In particular, I am wondering whether the basket size might play a role? It is clear that S3 has a much, much higher latency than a local SSD, so it is important to access large amounts of data with each request. Unfortunately, Amazon’s implementation of S3 does not allow to access several (HTTP) ranges in the same request, so I suppose each branch (and each basket?) requires an individual request. The branches in the current file have sizes (as per Basket Size shown by TTree::Print) between 100KB and 1.4MB, which is rather small for S3. Would it make sense to increase the basket size, and if yes, to what size/sizes and how?

Is there anything else that I can try? Is this even a good idea at all? What alternatives do I have for processing files from remote storage?

Thanks a lot in advance,
Ingo

ingomueller.net · August 6, 2021, 1:14pm

I have made some progress and can now answer some of my earlier questions. In particular I have found TTree::OptimizeBaskets and managed to run it on my file, configuring it with 1GB of maximum memory. This results in a basket size of exactly 12039934 for all 85 branches, which indeed sums up to about 1GB and makes the baskets at least about 10x larger than before. However, this had essentially no impact on performance.

I have also changed from m5d.xlarge instances (in AWS EC2) to m5dn.xlarge, which have “up to 25GB/s” networking. That networking stack should provide a similar bandwidth than the locally attached SSDs, so bandwidth should not be the problem. However, this did not change the performance either, suggesting that the bandwidth had not been the bottleneck before.

I am still interested in suggestions for getting better performance – or a confirmation that I have done everything right and that this is the best performance I can expect.

Cheers,
Ingo

eguiraud · August 9, 2021, 10:09am

Hi Ingo,
@Axel or @pcanal probably know best, but replies might be slower during August.

The “proper” (and also the standard) way to access ROOT files remotely is via XROOTD, which knows e.g. how to read only the branches that are required. I don’t know how smart HTTP access can be compared to access via XROOTD.

Cheers,
Enrico

ingomueller.net · August 9, 2021, 1:39pm

Hey Enrico,

Thanks for the quick reply! Maybe I’ll try to set up some xrootd service and test the performance. (I have tried to query the files from root://eospublic.cern.ch//eos over the internet, but unsurprisingly, that was really slow.)

I also don’t know about the “ROOT on S3” implementation details, but “Parquet on S3”, where you also often access a potentially small subset of columns (aka branches), can be implemented such that only the necessary bytes are transferred via HTTP from S3. My guess is that “ROOT on S3” is implemented similarly; I might try to measure the amount of transferred bytes to confirm.

Cheers,
Ingo

eguiraud · August 9, 2021, 1:54pm

Does tree->GetAutoFlush() return 0 for any of the trees you are reading? At the moment that wrongly disables pre-fetching and it happens with many NanoAODs (there is a PR up to fix this).

ingomueller.net · August 9, 2021, 2:14pm

No, tree->GetAutoFlush() returns 98329 on the original file and some files I derived from it.

system · August 23, 2021, 2:14pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

pcanal · August 30, 2021, 4:05pm

Hi Ingo,

As Enrico mentioned, xrootd is really the bread and butter of data access in Hep and thus is the most optimized/scrutinized path. Is there any chance you can install your own xrootd server ‘close enough’ to have the same nominal high network bandwidth?

With your new files (with basket in the 10Mb range), I would expect the access to be bandwidth limited (when comparing to access the file from SSD) since there would be ‘one’ message per 10Mb … unless the s3/davix implementation is splitting the message in smaller chunk.

To simply the debugging, you can try something like:

auto f = TFile::Open(filename, "READ");
char c = new char[very_large_amount];

and measure and/or debug the behavior of

f->ReadBuffer(c, 0, size_smaller_than_very_large_amount);

Cheers,
Philippe.

ingomueller.net · August 30, 2021, 4:32pm

Hey Philippe,

Thanks a lot for the reply! I had briefly looked into setting up an xrootd server but then dropped the idea due to time constraints. Is there a guide that you can recommend from the top of your head that I could follow?

Your suggestion looks similar to root-readspeed. Maybe that would be a more complete and trusted method to measure raw read speed from both local SSD and S3 (plus xrootd, if I get it to work)?

Cheers,
Ingo

pcanal · August 30, 2021, 4:37pm

root-readspeed is indeed another very useful data point. However your symptoms descriptions already points out that there is (very likely) a problem in the handling of an S3 server and so we ought to debug that. (On the other hand, if root-readspeed does not reproduce the problem then we would have to take another look at the benchmark code you used).