How is PCA variable transformation applied to LDA?

jwruss · August 24, 2020, 12:17am

Hello,

I was reading the descriptions for Variable decorrelation (4.1.2) and Principal component decomposition (4.1.3) in the TMVA User’s Guide. Between the two sections a distinction is made between signal (U=S) and background (U=B) transformations in Principal component decomposition.

What made sense to me is that when not specifying transformations on signal or background, the transformation would be calculated using a sample containing both signal and background together. Then the initial variables would be transformed and the analysis would proceed with the transformed variables. If Principal component decomposition (Principal Component Analysis, PCA) is indeed applied to separate samples of signal and background, that would mean there is one PCA basis for the signal sample and another PCA basis for the background sample. How then are the transformed variables coming from the signal basis related to the variables from the background basis in a linear discriminant analysis (LDA)? It is my understanding with LDA that signal and background are distinguished using the same set of variables.

kialbert · August 24, 2020, 10:08am

Hi,

I did not double check the documentation, nor read the code, but the PCA should be computed on the combination of signal and background.

In application, the classifier has no way, a priori, of discriminating against signal vs. background, so it makes little sense to treat them differently.

Thus TMVA should have a single basis when applied on the whole dataset.

Cheers,
Kim

jwruss · September 6, 2020, 3:52am

Okay, so if in TMVA::BookMethod() I specify VarTransform=P, then it uses PCA on a sample from both signal and background.

I could use some clarification as to what happens when I specify VarTransform=P_Signal. I assume this constructs a PCA basis using only signal sample, then background sample measurements are projected onto this signal basis. This is assuming the input variables coming from both signal and background samples are projected onto this signal basis, and TMVA trains with these projections as variables. Is this correct?

If this understanding is correct, I assume it applies to other transformations as well. For example, when specifying VarTransform=G, the Gaussianization transformation is done using CDFs constructed using histograms of the input variables sampling from both signal and background (How are number of bins chosen?). But if I were to specify VarTransform=G_Signal, the input variable histograms used to construct the variable CDFs would only contain signal sample. Would background samples be Gaussianized in reference to these only-signal-sampled variable CDFs? I’m presuming training is done with the Gaussianized variables, so both signal and background would have to undergo Gaussianization transformation.

In the Gaussianization example above, I could see another interpretation with VarTransform=G. For each input variable, a pair of CDFs are created. One in the pair references a histogram sampling from signal, the other in the pair references a histogram sample from background. Then in the TMVA training, the variables transformed using the signal variable histograms are trained against the variables transformed against the background variable histograms. But this isn’t correct is it? It seems like specifying variable transformations with either _S or _B should be treated in the same manner over any choice in transformation.

kialbert · September 10, 2020, 12:33pm

Hi,

My understanding is that your alternative 1 (VarTransform=X_Signal calculates the transform from signal sample only, applies to both signal and background) is correct. I checked this with section 4.1.4-5 in the TMVA Users’ guide, so I’m reasonably sure my interpretation is correct.

For the issue of how many bins in the gaussian transform: The number of bins is automatically determined and is between 1 and 2000 based on the event distribution. TMVA tries to ensure a reasonable number of events per bin. (The algorithm was not entirely clear to me after my quick reading).

To further elaborate the process: There are C+1 cdf’s used, each with one histogram per used variable (i.e. (C+1) * V histograms in total). This is independent of whether signal-only, background-only or signal+background data is used.

Cheers,
Kim

jwruss · September 10, 2020, 9:54pm

Thank you for clarifying, Kim!