Read speeds with RDataFrame in python

lcorcodilos · April 23, 2020, 3:13pm

Hello again experts,

I have a question regarding read speeds/disk usage when using RDataFrame in python. Apologies ahead of time that this is quite technical. I’m hoping to hear about any similar experiences others may have had as opposed to expecting a specific solution

I have an 8 core/16 thread modern desktop CPU that I’ve been running RDataFrame jobs on. I’ve copied CMS NanoAOD data and simulation sets to the machine on a 6 TB HDD that runs at 5400 RPM with a cache of (I think) 256 MB, connected via SATA (6Gbps), and formatted as NTFS. I read from this drive and write histograms, snapshots, etc to an SSD.

If I benchmark the HDD, I get speeds of ~180 MB/s and if I preform a real reading task (copy from the drive to another), I get 80-100 MB/s.

I’ve been having some unexpected speed issues when reading large groups of NanoAOD files from this HDD where the allotted CPU threads jump to 100% and then after 15-30 seconds fall to 0-50% of their processing capacity for the remainder of the run (so, not a CPU bottleneck). When viewing these in htop, I can see that the status of each sub-process is switching between ‘R’ (running) and ‘D’ (disk sleep) with sometimes all of the process sitting in the ‘D’ state for a second or two. Viewing iotop shows that the actual read speeds are only 5-10 MB/s. This value will also go lower depending upon which set I’m reading (haven’t figured out a correlation there just yet).

I’ve tried using fewer and more threads and running concurrent jobs (on different simulation sets but on the same drive) but the highest read speeds I’ve reached are around 10 MB/s.

I also moved one set (~370 GB total) to a different HDD (1TB, 7200 RPM, 64MB cache, Ext4) and have the same experience. If I run that concurrently with a set on the 6 TB drive, I can get total read speeds of ~15 MB/s.

So I believe the drives are healthy and I know the CPU has more processing power to use but I’m not sure what’s causing the slowdown. I plan on moving some files to an SSD for testing (just need to make some room ). Has anyone had any similar experiences though? Maybe this is a python issue?

Thanks!
Lucas

ROOT Version: v6.20
Platform: Ubuntu 18.04 LTS
Compiler: Not Provided

eguiraud · April 23, 2020, 3:32pm

Hi,
I don’t know what might be the reason why reading speeds are so low, but in order to gain some insights it would be interesting to know how the same code performs on an SSD, how turning multi-threading on/off affects speeds (ROOT::EnableImplicitMT or not), and how fast TTree is w.r.t. single-thread RDF on a very simple task, something like summing up the values of a few branches for all entries.

Cheers,
Enrico

lcorcodilos · April 23, 2020, 5:26pm

Hi Enrico,

Okay so here are some tests.

I took a subset of one of the samples so I can play with tests a bit quicker. I copied over 32 GB of files to the SSD and then 100 GB. I kept a list of the equivalents on the HDD so that when I run a test on each the SSD and HDD, we compare the same data.

The read speeds are very approximate. I’m pretty sure iotop is just pinging to see how much has happened since its last ping so I wouldn’t pay attention to anything other than the order of magnitude.

Test 1: MT off
HDD:

Read speed: Up to 150 kB/s (looks like it’s working in small chunks because I can see it read and write back and forth quickly)
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 1.19 min

SSD:

Read speed: Same as HDD
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 1.20 min

Test 2: EnableImplicitMT(1)
HDD:

Read speed: Up to 300 kB/s
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 1.29 min

SSD:

Read speed: Same as HDD
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 1.29 min

Test 3: EnableImplicitMT(4)
HDD:

Read speed: Up to 300 kB/s
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 0.56 min

SSD:

Read speed: Up to 500 kB/s
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 0.56 min

At this point, I realized we might need a bigger part of the data to get a better test so I jumped from 32 GB to 100 GB.

Test 4: MT off - 100 GB
HDD:

Read speed: Up to 150 kB/s
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 3.57 min

SSD:

Read speed: Same as HDD
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 3.55 min

Test 5: EnableImplicitMT(1) - 100 GB
HDD:

Read speed: Up to 150 kB/s
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 3.81 min

SSD:

Read speed: Same as HDD
Disk sleep: No
Approx. thread utilization: 100 %
Processing time: 3.83 min

Test 6: EnableImplicitMT(4) - 100 GB
HDD:

Read speed: 2-4 MB/s
Disk sleep: YES
Approx. thread utilization: 10-20 % with a single thread increasing to 100 % for 1-2 min before dropping back down
Processing time: 5.82 min

SSD:

Read speed: 10-20 MB/s
Disk sleep: No
Approx. thread utilization: 90-100 %
Processing time: 1.49 min

So clearly there’s some sort of connection between multithreading, the size of a group of files (I’ve been using TChain to connect these), and the “disk sleep” (my understanding is that this means the scheduler is waiting to hear from the disk where the next chunk of data is before it tells the CPU to continue the process waiting for the data).

I’m wondering why the disk is taking so long to respond and why adding multiple threads makes the process read data in larger chunks.

Maybe of interest is how I’m passing the files to RDataFrame. I do:

eventsChain = ROOT.TChain(eventsTreeName) 
for fileName in fileNames:
    eventsChain.Add(fileName)

BaseDataFrame = ROOT.RDataFrame(eventsChain)

One other strange thing that I’ve seen is that the tests are not always reproducible. I just tried Test 6 again with the HDD and as I would hope.

Feeling lucky, I tried to do the full 370 GB but it went back to the “disk sleep” slowdown. So I tried Test 6 on the HDD again (with 100 GB) and now it’s back to being slow.

My only guess for that behavior is some sort of caching of data but that seems far-fetched given the size of the files.

eguiraud · April 23, 2020, 5:46pm

Multiple threads can easily choke the HDD as it needs to physically go back and forth as each thread requires a seek+get at a different position.

My only guess for that behavior is some sort of caching of data but that seems far-fetched given the size of the files.

On linux you can check utilization of the filesystem cache with free and it’s also possible to manually empty it before running one of your tests, to make sure you are always running with a cold cache.

Test 2: EnableImplicitMT(1)
HDD:

Read speed: Up to 300 kB/s

It makes no sense to me that read speed doubles with EnableImplicitMT(1) w.r.t. no multi-threading.

lcorcodilos · April 23, 2020, 6:09pm

Frankly, the 24 minutes or so that it takes to go over 370 GB is not bad at all even if it’s sub-optimal - there will always be a bottleneck.

I make this a bigger issue for myself though by using EnableImplicitMT() in conjunction with python multiprocessing (as I posted about here). I imagine this just exacerbates the issue because now the disk is jumping between many different files. I didn’t think this would be such a big deal but I ran four worker nodes with four threads each overnight for about 20 different sets of files and woke up to it pretty much frozen (which is what brought me here).

It makes no sense to me that read speed doubles with EnableImplicitMT(1) w.r.t. no multi-threading.

I think this is just inconsistency from iotop. I think it’s the difference between the measurements being taken at the edges of a transfer and splitting it in the middle. And then there’s human error because I’m just staring watching for the highest number and maybe I miss when the ping gets lucky and sees a 300kB chunk in 1 second (these studies were not terribly scientific).

pcanal · April 23, 2020, 6:51pm

A couple of test that can be done for measurements.

One is to see the performance of hadd (hadd -f /var/tmp/output.root [files_up_to_100GB]) [If you need more than 100Gb this can be fixed with rootlogon.C]

Another is to see the performance of ‘just’ reading the chain. TChain ch(..); ch.AddFiles(...); for(Long64_t e = 0; e < ch.GetEntriesFast; ++i) ch.GetEntry(e);

And then you can also play with splitting the chain in multiple chunk and run those in parallel via multiple process.

system · May 7, 2020, 6:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.