TPrincipal Problem, test included

skwang · January 6, 2010, 5:59pm

I am trying to use the TPrincipal class on my data. I basically have a TH2D histogram which I want to determine the principal components. After getting nonsensical answers I have testing TPrincipal with some test input data with known output. The code which I used to test this has been attached. I have tested this with ROOT v5_24_00.

Run the Macro with

.L PCATest.C
PCATest()

The attached file has the “answers” at the top of the file. What I have noticed is that the covariance matrix which TPrincipal calculates is not the same as what I included in the file (which I calculated by hand in a spreadsheet).

So my question is: Am I missing something about how TPrincipal works? Is there a bug?

Update

I realized that I am calling TPrincipal without any options (just a name) so that it normalizes the covariance matrix (the default behavior).

If I call TPrincipal differently withTPrincipal *pcatest = new TPrincipal(numVariables,"D");so that the covariance matrix is not normalized, I get different results.

I still don’t get the same covariance matrix as I calculated. And The PCA’s eigenvalues are not the as expected, but eigenvectors are “correct”.
PCATest.C (2.4 KB)

brun · January 6, 2010, 8:18pm

I expect that Christian Holm Christensen (author of this class) will answer your question.

Rene

cholm · January 11, 2010, 9:19am

Hi,

The covariance matrix stored in TPrincipal is normalised by the trace of the `raw’ covariance matrix. That is,

   Covar_ij' = Covar_ij / Tr(Covar) = Covar_ij / sum_i=0^M-1 Covar_ii

where M is the dimension of the problem, and

   Covar_ij = 1 / N sum_i=0^N-1  (x_i - m_i) * (x_j - m_j)

where m_i is the mean of the i’th variable, and N the number of observations.

This normalisation will happen irrespective of whether the ‘N’ option is passed to the TPrincipal constructor.

I have a attached a modified version of the test script that shows this.

Note, that the absolute values of the eigenvalues and length of the eigenvectors is not important. What matters for PCA is the relative size of the eigenvalues and that the principal eigenvectors span the appropriate space.

Yours,

Christian
PCATest.C (6.49 KB)

skwang · January 11, 2010, 4:00pm

Christian,

Thanks for the reply. From what I understand of your post, and your modified PCATest.C code, TPrincipal will calculate the eigenvectors correctly given a two-dimensional dataset. That is really all I need, as the eigenvalues are not needed.

(I need the eigenvalues to convert the original basis (X) to the new orthonormal basis §, and, as you know, your X2P() and P2X() functions only uses the eigenvectors.)

Thanks for clarifying what TPrincipal does.

Shawn

Hermes · April 21, 2010, 8:16am

Hi,

I am trying to start using this method and followed this discussion.
I downloaded your macros and tried to understand what they were doing, but still I don’t get why the “Real eigen values” are so different that the “PCA eigen values”.

I understand that the important thing is the relative size of the eigenvalues, but for the real case it seems that the important variable is 1, and for the PCA method is 0, is this correct or am I missing something?

Also on:
root.cern.ch/root/html/TPrincipal.html

there is a link to some more documentation, but is not working, it appears in this sentence of the text:

A short outline of the method of Principal Components is given in subsection 1.3.

Thanks in advance,
Hermes

brun · April 22, 2010, 6:26am

The TPrincipal class is always normalizing the variables, this explains the difference between what you call normal values and the eigen values computed by the class.

I fixed the documentation problem that you reported (thanks).

Rene