I recently needed to do PCAs of large data sets and turned to the ROOT TPrincipal class, which I have used before (though long ago). This worked out fine and we got some nice results. However as some of the analyses were taking upwards of 5 minutes I started looking into ways of speeding it up. The data sets are hyper spectral images, typically involving 200,000 or so 250-dimensional vectors in each analysis. The application is running in Windows, written in Visual Studio.
Anyway, what I came up with is a class derived from TPrincipal that overrides a few functions - so far just those relevant to our purposes and possibly breaking some other functionality - but that runs 10 to 25 times faster, so our 5 minute analysis is down to 12 seconds in some cases.
So my question is, is anyone else interested in this? If so I’ll tidy it up and submit a code contribution.
I should mention that this isn’t anything overly clever or complicated. I know there are some “Fast PCA” strategies that work quite differently (and which I didn’t really understand when I read about them), but this is just a standard PCA optimised for speed. It’s also not by parallel processing - I’ve also done a parallel version that gets the 5 minutes down to 4 seconds on a quad CPU, but this uses Windows synchronisation classes.