Advanced Preprocessing: Introduction
Data preprocessing is often employed in multivariate analysis but it is often unclear why and when to preprocess the data, and in what order. The topic gets even more confusing when the large number of preprocessing methods is considered. In short, the objective in data preprocessing is to separate the signal of interest from clutter where clutter is defined as all signal that is not of interest (e.g., signal attributable to interferences and noise). This means that the appropriate preprocessing method depends on the data analysis objective, and on how the signal and clutter manifest in the data. Obviously, this topic can get pretty complicated and confusing very quickly. However, it is the intention here to provide only a brief introduction to why preprocessing is performed and in what order. Simple preprocessing methods are used as examples to introduce these concepts. A more thorough discussion of the objective, theory and math of each preprocessing procedure is not included here.
Preprocessing is typically performed prior to data analysis methods such as principal components analysis (PCA) or partial least squares regression (PLS). Recall, that PCA maximizes the capture of sum-of-squares with factors or principal components (PCs) within a single block of data, and PLS is slightly more complicated method that finds linear relationships between two blocks of data. This introduction will use PCA in the examples. Two of the simplest examples of preprocessing are mean-centering and autoscaling and these two methods will be described in a bit more detail, but first a description of the data analysis objective with no preprocessing will be discussed.
Imagine that an MxN data matrix is available and the objective is to perform exploratory analysis of this data using PCA. Recall that samples (or objects) correspond to the rows of and variables correspond to the columns. If no preprocessing is applied to prior to the PCA decomposition, then the PCA loadings will capture the most sum-of-squares in centered about the origin (i.e., the model is a force fit about zero). In this case, the first principal component (PC) will point in the direction that captures the most sum-of-squares about zero (variance about zero).
Next, define the Nx1 mean of data matrix as . The mean is calculated down the rows of so that for the nth column of (i.e., the nth element of the vector ) the mean is a scalar and is given by
(1)
The mean centered data is then calculated by subtracting the column mean from the corresponding column so that :{| border="0" width="100%" | || (2) |}