Optimize ZFS record size for Root files

Nicola_Mori · February 25, 2021, 7:54am

I’m setting up a large file server for storing experiment data as Root files using ZFS. I’d like to optimize the read operations on these files (they are typically wrote once and read many times) and one of the parameters I can tune is the record size. If I correctly understood its optimal size depends on the typical size of the fetch operations done on files, so I was wondering if Root fetches data from disk with a fixed or typical fetch size. In our software we don’t tune the Root I/O operations in any particular way, the usual read operation is a simple

TFile *inputFile = TFile::Open("myFile.root");

(or the plain equivalent using a TChain); the file size ranges from tens to thousands of MB. In this context, is there an optimal record size for my ZFS dataset that can benefit the readout of the files?

Thanks in advance for any hint anyone might provide.

bellenot · February 25, 2021, 7:59am

Maybe @pcanal or @jblomer can give some hints

Nicola_Mori · March 4, 2021, 9:56am

@pcanal @jblomer any hint? Thanks.

jblomer · March 4, 2021, 5:21pm

Apologies for the delay! Without further knowledge, from the provided range I’d suggest the smallest value of a few tens of MB. Typically, ROOT files are read sparsely, so small block sizes should be beneficial. How big the read requests really are depends on the I/O buffer size of the application and on how the file was written. If you have sample applications, TTreePerfStats can be used to determine the key figures. Perhaps @pcanal has more experience specifically with ZFS.

Nicola_Mori · March 4, 2021, 5:27pm

@jblomer no need to apologize and many thanks for the answer! I don’t clearly understand the I’d suggest the smallest value of a few tens of MB sentence: the default ZFS record size is 128 KB, and the typical tuning suggestions that can be found on the web range from 16 KB for database applications to 1 MB for file streaming. So if you mean setting it to few tens of MB then I don’t know whether this would be appropriate (I’d say it will waste a lot of bandwidth if the readout is not strictly sequential).

jblomer · March 8, 2021, 8:27am

Oh, I understood that the smallest possible block size is of the order of 10MB.

The standard TTree basket size is 32kB, but in practice individual reads requests are often combined and therefore larger. I’d suggest 128kB to start with, but in fact I’d be very interested myself if you see a difference in performance for, say, 32kB or 1MB (I made a note to myself, that’s something we should benchmark). Another point to consider is the hardware block size. I think the file system block size should ideally be a multiple of the block size of the drive.

Nicola_Mori · March 8, 2021, 5:43pm

I’ve made a small test with a simple workload (reading a tree with branches made of single PODs sequentially) using TTreePerfStats for 32K, 128K and 1M record sizes. I didn’t see any major difference, but 128K seems to be slightly more performant that other sizes:

rec. size	Disk time	Disk IO
32K	35.6 s	167.9 MB/s
128K	29.8 s	198.7 MB/s
1M	31.0 s	191.1 MB/s

I don’t know how to precisely interpret these numbers, e.g. higher IO for 128K w.r.t. 32K could be due to the overhead coming from the increased record size but the lower disk time makes me think about a real performance gain. Anyway, the 128K default seems to work at least for this simple test, so I think I’ll stick with it.

@jblomer if you have any insight, suggestion for other tests or performance figures then I’ll be glad to work more on this. And thanks for the support!

system · March 22, 2021, 5:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.