# Advanced Preprocessing: Introduction

Data preprocessing is almost always employed in multivariate analysis but because of the large number of possible preprocessing methods, the topic can sometimes be confusing.

In general, the objective of data preprocessing is to separate the signal of interest from clutter where clutter is defined as all signal that is not of interest (e.g., signal attributable to interferences and noise). This means that the preprocessing method appropriate for given data depends on the data analysis objective and on how the signal and clutter manifest in the data.

Obviously, many papers have been written discussing the use of various preprocessing methods. It is the intention here to provide only a general introduction to why preprocessing is performed, how the math is implemented, and in what order different methods may be used. Simple preprocessing methods are used as examples to introduce these concepts.

### Preprocessing in Context of a Model

Preprocessing is typically performed prior to data analysis methods such as principal components analysis (PCA) or partial least squares regression (PLS). Recall, that PCA maximizes the capture of sum-of-squares with factors or principal components (PCs) within a single block of data, and PLS is slightly more complicated method that finds linear relationships between two blocks of data. This introduction will use PCA in the examples. Two of the simplest examples of preprocessing are mean-centering and autoscaling and these two methods will be described in a bit more detail, but first a description of the data analysis objective with no preprocessing will be discussed.

### Concepts of Calibration and Application of Preprocessing

Imagine that an MxN data matrix ${\displaystyle \mathbf {X} }$ is available and the objective is to perform exploratory analysis of this data using PCA. Recall that samples (or objects) correspond to the rows of ${\displaystyle \mathbf {X} }$ and variables correspond to the columns. If no preprocessing is applied to prior to the PCA decomposition, then the PCA loadings will capture the most sum-of-squares in ${\displaystyle \mathbf {X} }$ centered about the origin (i.e., the model is a force fit about zero). In this case, the first principal component (PC) will point in the direction that captures the most sum-of-squares about zero (variance about zero).

Next, define the Nx1 mean of data matrix ${\displaystyle \mathbf {X} }$ as ${\displaystyle \mathbf {\bar {x}} }$. The mean is calculated down the rows of ${\displaystyle \mathbf {X} }$ so that for the nth column of ${\displaystyle \mathbf {X} }$ (i.e., the nth element of the vector ${\displaystyle \mathbf {\bar {x}} }$) the mean is a scalar and is given by

 ${\displaystyle {\bar {x}}_{n}={\frac {1}{M}}\sum _{m=1}^{M}{x_{n}}}$ (1)

The mean centered data ${\displaystyle \mathbf {X} _{mncn}}$ is then calculated by subtracting the column mean from the corresponding column so that

 ${\displaystyle x_{m,n,mncn}=x_{m,n}-{\bar {x}}_{n}}$ for ${\displaystyle n=1,...,N;m=1,...,M}$ ${\displaystyle \mathbf {x} _{n,mncn}=\mathbf {x} _{n}-\mathbf {1} {\bar {x}}_{n}}$ for ${\displaystyle n=1,...,N}$ ${\displaystyle \mathbf {X} _{mncn}=\mathbf {X} -\mathbf {1} {\bar {\mathbf {x} }}^{T}}$ (2)

where ${\displaystyle \mathbf {1} }$ is a Mx1 vector of ones (typically it is assumed that ${\displaystyle \mathbf {1} }$ is of appropriate size) and T is the transpose operator. (The notations in Equation 2 provide identical results, but the simplicity of the last form shows why the linear algebra notation is often preferred.) The first step in mean-centering, represented by Equation 1, is to calculate the mean of each column of ${\displaystyle \mathbf {X} }$. This procedure can be considered “calibration” of the mean-centering preprocessing and it consists of estimating the mean from the “calibration” data ${\displaystyle \mathbf {X} }$. The second step, represented by Equation 2, subtracts the mean from the data. This procedure can be considered “applying” the centering to the data ${\displaystyle \mathbf {X} }$. The first PC of a PCA model of ${\displaystyle \mathbf {X} _{mncn}}$ will then capture the most sum-of-squares about the mean ${\displaystyle {\bar {\mathbf {x} }}}$ (variance about the mean or simply ‘variance’). The mean is now a part of the overall PCA model “calibrated” on the “calibration” data ${\displaystyle \mathbf {X} }$ and the mean-centering operation has changed what sum-of-squares is captured by the first PC. In other words the preprocessing has changed the data to get the PCA model to focus on a different type of variance. As a result, the PCA model must be interpreted differently during the exploratory analysis.

Next, assume that a new M2xN data matrix ${\displaystyle \mathbf {X} _{2}}$ was available where M2>=1. To apply the PCA model calibrated above to the new data, the new data set must first be centered to the mean of the calibration data. The preprocessing is “applied” to the new data ${\displaystyle \mathbf {X} _{2}}$ using a procedure analogous to Equation 2 as follows:

 ${\displaystyle \mathbf {x} _{2,n,mncn}=\mathbf {x} _{2,n}-\mathbf {1} {\bar {x}}_{n}}$ (3)

### Calibration and Application of Autoscaling

Autoscaling of the data is treated in a manner very similar to mean-centering but the preprocessing includes an additional step. During calibration, Equation 1 is first used to estimate the mean ${\displaystyle \mathbf {\bar {x}} }$ of the calibration data ${\displaystyle \mathbf {X} }$. Next, the standard deviation of each column is calculated using:

 ${\displaystyle s_{n}=\left[{\frac {1}{M-1}}\sum _{m=1}^{M}{\left(x_{n}-{\bar {x}}_{n}\right)^{2}}\right]^{1/2}}$ (4)

Equations 3 and 4 correspond to “calibration” of the autoscaling preprocessing procedure. After calibration, the mean-centered columns are divided by the corresponding standard deviation as follows

 ${\displaystyle \mathbf {x} _{n,auto}={\frac {\mathbf {x} _{n,mncn}}{s_{n}}}={\frac {\mathbf {x} _{n}-\mathbf {1} {\bar {x}}_{n}}{s_{n}}}}$ (5)

Autoscaling includes mean-centering and division by the standard deviation and Equation 6 corresponds to “applying” the preprocessing to the calibration data. Equation 6 shows how the preprocessing is applied to new data ${\displaystyle \mathbf {x} _{2}}$.

 ${\displaystyle \mathbf {x} _{2,n,auto}={\frac {\mathbf {x} _{2,n,mncn}}{s_{n}}}={\frac {\mathbf {x} _{2,n}-\mathbf {1} {\bar {x}}_{n}}{s_{n}}}}$ (6)

In summary, the autoscaling preprocessing parameters ${\displaystyle {\bar {x}}_{n}}$ and ${\displaystyle s_{n}}$ for n=1,...N were estimated from the calibration data ${\displaystyle \mathbf {X} }$, and the application step used these parameters to center and scale both ${\displaystyle \mathbf {X} }$ and new data ${\displaystyle \mathbf {X} _{2}}$. It should be clear that the calibration data should be sufficiently representative of what is expected in the future if the estimated preprocessing parameters will adequately represent the mean and standard deviation of new data. Also, variables (columns) with large standard deviation are now down-weighted relative to variables with small standard deviation. This changes the relative sum-of-squares for the preprocessed data and the first PC will now capture the largest sum-of-squares relative to the mean of the weighted matrix ${\displaystyle \mathbf {X} _{auto}}$.

### Other Preprocessing Methods

Although outside of the scope of the present introduction, it should be noted that some preprocessing methods do not operate down the rows but instead operate across the columns. As such, estimates such as the mean and standard deviation might not be estimated from the data. However, these methods most often include settings or parameters that dictate how they operate and it is important that all the data are treated similarly. As a result, the preprocessing settings are a part of the model just like the estimated means and standard deviations.

However, it should be clear that estimated preprocessing parameters and settings for the preprocessing are all a part of the model established during the “calibration” step, and that these parameters and settings are stored as a part of the model. Subsequently, during the model application step the preprocessing parameters are applied to new data. The two step model “calibration” and model “application” includes preprocessing as well as data modeling such as PCA.

It should also be clear that preprocessing can change the focus of the data modeling procedure. For example, PCA always captures the most sum-of-squares in the first PC. However, the different preprocessing methods examined above changed what sum-of-squares was the focus of the PCA decomposition. It is in this way that preprocessing can be used to tune what variance is captured by the PCA or PLS model.

### Conclusions and Other Reading

This brief introduction described how preprocessing is calibrated (based on calibration data) and applied (to both the calibration and new test data). A more detailed discussion of mean-centering and autoscaling for PCA can be found in Wise, B.M. and Gallagher, N.B., "The Process Chemometrics Approach to Chemical Process Monitoring and Fault Detection," J. Proc. Cont. 6(6), 329-348 (1996).

A thorough discussion of the objectives, theory, and equations associated with specific preprocessing methods can be found using list of preprocessing methods on the Preprocessing Methods page.