Scores and Sample Statistics

From Eigenvector Research Documentation Wiki
Revision as of 13:15, 6 December 2011 by imported>Jeremy (→‎Decomposition Methods)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Decomposition Methods

The decomposition methods generally use only the X-block and the following statistics are available:

Property Description
Scores on PC/Comp/LV Scores give the amount that each PC or component ("Latent Variable" or LV, generically) contributes to each sample. In models like Purity, MCR, and PARAFAC, this is theoretically proportional to chemical concentration or other quantitative property (depending on the physics of the measurements being analyzed.) This is the T term in the equation: X = TP + E
Q Residuals Sum square residuals (aka Q Residuals) is a scalar value for each sample which describes how much of the signal in each sample is left unexplained by the model. The higher this value, the more likely the sample contains some other systematic response which the model failed to describe/capture. Q is the summation across variables of the squared E term from the equation: X = TP + E.

where i is the index for samples, j is the index for variables and represents the [i,j] element of the E matrix.

Hotelling T^2 Hotelling T-squared is a scalar value for each sample which describes the sum squared scores, corrected for variance captured in each component (PC,LV,etc). It gives the distance to the multivariate center of the model. The larger this value, the further away from the center and, if the sample is part of the calibration set, the more influence the sample had in the model's fitting. Hotelling T-squared can be considered the counterpart to Q Residuals. Taken together, these two statistics give how much variance the model captured (T^2) and how much was left over (Q).
KNN Score Distance (k=3) Gives the average distance to the k nearest neighbors (in most cases, k=3) in score space for each sample. This value is an indication of how well sampled the given region of the scores space was in the original model. If a sample is fairly unique, it will be alone in a region of the scores plot and the resulting KNN Score Distance will be high. A high KNN Score Distance for test or prediction sample may indicate the given sample is not sufficiently like the calibration set to trust the predictions - particularly with mildly or highly non-linear repsonses. For more information, see the description in knnscoredistance.


Property PCA Purity MCR PARAFAC
Scores on PC / Comp X X X X
Q Residuals X X X X
Hotelling T^2 X X
KNN Score Distance (k=3) X X X X