Glsw: Difference between revisions
imported>Jeremy |
imported>Neal |
||
Line 14: | Line 14: | ||
===Description=== | ===Description=== | ||
This filter uses Generalized Least Squares (GLS) to down-weight features identified from the singular value decomposition of a data matrix. | This filter uses a Generalized Least Squares (GLS) based weighting strategy to down-weight features identified from the singular value decomposition of a clutter data matrix. Clutter is context dependent and the cases are described in detail below. | ||
If the singular value decomposition (SVD) of the input matrix x is '''X=USV'''<sup>T</sup> then the deweighting matrix is estimated with the following pseudo-inverse: | If the singular value decomposition (SVD) of the input matrix <tt>x</tt> is '''X''' = '''USV'''<sup>T</sup> then the deweighting matrix is estimated with the following pseudo-inverse: | ||
:'''W'''= '''U'''diag(sqrt(1/(diag('''S''')/a<sup>2</sup>+1)) | :'''W'''= '''U'''diag( sqrt(1/(diag('''S''')/a<sup>2</sup>+1) )'''V'''<sup>T</sup> = '''US'''<sub>inv</sub>'''V'''<sup>T</sup> | ||
where | where '''S'''<sub>inv</sub> corresponds to a regularized inverse of the singular values. The adjustable parameter a is a regularization parameter used to scale the singular values prior to calculating their inverse. As a gets larger, the extent of deweighting decreases (because '''S'''<sub>inv</sub> approaches 1). As a gets smaller (e.g., 0.1 decreasing to 0.001) the extent of deweighting increases (because '''S'''<sub>inv</sub> approaches 0) and the deweighting includes increasing amounts of the the directions represented by smaller singular values. A good initial guess for a is 1x10<sup>-2</sup> but will vary depending on the covariance structure of '''X''' and the specific application. It is recommended that a number of different values be investigated using an external cross-validation metric for performance evaluation. | ||
For more information see H. Martens, M. Høy, B.M. Wise, R. Bro and P.B. Brockhoff, "Pre-whitening of data by covariance-weighted pre-processing," J. Chemom., '''17'''(3), 153-165, 2003. | |||
This function will also perform EPO (External Parameter Orthogonalization) which is GLSW with a filter built from a specific number of singular vectors rather than the weighting scheme described above and EMM (Extended Mixture Modeling) filtering which is EPO orthogonalizing to all available singular vectors. To perform EPO, a negative integer is supplied in place of (a) where -a specifies the number of singular vectors to include in the filter. This is GLSW with a square-wave function for the deweighting. To perform EMM, a negative infinity (-inf) is supplied in place of (a). | This function will also perform EPO (External Parameter Orthogonalization) which is GLSW with a filter built from a specific number of singular vectors rather than the weighting scheme described above and EMM (Extended Mixture Modeling) filtering which is EPO orthogonalizing to all available singular vectors. To perform EPO, a negative integer is supplied in place of (a) where (-a) specifies the number of singular vectors to include in the filter. This is GLSW with a square-wave function for the deweighting i.e., the first a singular values of '''S'''<sub>inv</sub> are set to zero and the remaining singular values are set to 1. To perform EMM, a negative infinity (-inf) is supplied in place of (a). | ||
Finally, an alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual. | Finally, an alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in '''X''' by comparing the differences in the corresponding '''y''' values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted '''X'''- and '''y'''-blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual. | ||
For calibration, inputs can be provided by one of four methods: | For calibration of the GLSW model <tt>modl</tt>, inputs can be provided by one of four methods: | ||
'''1)''' | '''1)''' <tt>modl = glsw(x,a)</tt> | ||
:: | :: <tt>x</tt> = a clutter data or covariance matrix containing features to be downweighted, and | ||
:: | :: <tt>a</tt> = scalar regularization parameter that governs downweighting {default = 1e-2}. | ||
:: '''Note''': If <tt>x</tt> is a dataset with classes, differences within ''each class'' are used for down-weighting (i.e., intra-class variance is considered clutter). This reduces intra-class variation but ignores the inter-class variation. Only classes with class numbers >0 are included in the clutter calculation (see DataSet object for more information). | |||
:: ''' | |||
: | |||
''' | '''2)''' <tt>modl = glsw(x1,x2,a)</tt> | ||
:: | :: <tt>x1</tt> = a ''M'' by ''N'' data matrix and | ||
:: ''' | :: <tt>x2</tt> = a ''M'' by ''N'' data matrix. | ||
:: | ::: The clutter is defined as x = x1-x2; the row-by-row differences between x1 and x2. The input data represents two or more measured populations which should otherwise be the same (e.g., the same samples measured on two different analyzers or using different solvents). | ||
:: <tt>a</tt> = scalar regularization parameter that governs downweighting {default = 1e-2}. | |||
''' | '''3)''' <tt>modl = glsw(x,y,a)</tt> | ||
:: ''' | :: <tt>x</tt> = a ''M'' by ''N'' data matrix, | ||
:: ''' | :: <tt>y</tt> = column vector of integers with ''M'' rows specifing sample groups in x within which differences should be downweighted. | ||
:: | :: '''Note''': This method is identical to method (1) when classes of the '''X'''-block are used to identify groups. The only difference is that the groups are identified from the separate input <tt>y</tt> instead of the dataset classes. If <tt>y</tt> is empty, this defaults to method (1) without class information where <tt>x</tt> is then defined as the clutter data matrix. | ||
:: <tt>a</tt> = scalar regularization parameter that governs downweighting {default = 1e-2}. | |||
'''4)''' <tt>modl = glsw(x,y,a)</tt> | |||
:: <tt>x</tt> = a ''M'' by ''N'' data matrix, | |||
:: <tt>y</tt> = column vector with ''M'' rows specifying a '''y'''-block continuous variable. In this input, the "gradient method" is used to identify similar samples and downweight differences between them. See also the gradientthreshold option below. | |||
:: <tt>a</tt> = scalar regularization parameter that governs downweighting {default = 1e-2}. | |||
The input <tt>a</tt> can be replaced with an options structure (see Options below). | The input <tt>a</tt> can be replaced with an options structure (see Options below). | ||
When applying a GLSW model the inputs are <tt>newx</tt>, the | When applying a GLSW model the inputs are <tt>newx</tt>, the '''X'''-block to be deweighted, and <tt>modl</tt>, a GLSW model structure. | ||
Outputs are <tt>modl</tt>, a GLSW model structure, and <tt>xt</tt>, the deweighted | Outputs are <tt>modl</tt>, a GLSW model structure, and <tt>xt</tt>, the deweighted '''X'''-block. | ||
===Options=== | ===Options=== |
Revision as of 09:28, 6 March 2012
Purpose
Calculate or apply Generalized Least Squares weighting (GLSW), External Parameter Orthogonalization (EPO), and Extended Mixture Model (EMM) filters. See also GLSW_Settings_GUI.
Synopsis
- modl = glsw(x,a); %GLS on matrix
- modl = glsw(x1,x2,a); %GLS between two data sets
- modl = glsw(x,y,a); %GLS on matrix in groups based on y
- modl = glsw(modl,a); %Update model to use a new value
- xt = glsw(newx,modl,options); %apply correction
- xt = glsw(newx,modl,a); %apply correction
Description
This filter uses a Generalized Least Squares (GLS) based weighting strategy to down-weight features identified from the singular value decomposition of a clutter data matrix. Clutter is context dependent and the cases are described in detail below.
If the singular value decomposition (SVD) of the input matrix x is X = USVT then the deweighting matrix is estimated with the following pseudo-inverse:
- W= Udiag( sqrt(1/(diag(S)/a2+1) )VT = USinvVT
where Sinv corresponds to a regularized inverse of the singular values. The adjustable parameter a is a regularization parameter used to scale the singular values prior to calculating their inverse. As a gets larger, the extent of deweighting decreases (because Sinv approaches 1). As a gets smaller (e.g., 0.1 decreasing to 0.001) the extent of deweighting increases (because Sinv approaches 0) and the deweighting includes increasing amounts of the the directions represented by smaller singular values. A good initial guess for a is 1x10-2 but will vary depending on the covariance structure of X and the specific application. It is recommended that a number of different values be investigated using an external cross-validation metric for performance evaluation.
For more information see H. Martens, M. Høy, B.M. Wise, R. Bro and P.B. Brockhoff, "Pre-whitening of data by covariance-weighted pre-processing," J. Chemom., 17(3), 153-165, 2003.
This function will also perform EPO (External Parameter Orthogonalization) which is GLSW with a filter built from a specific number of singular vectors rather than the weighting scheme described above and EMM (Extended Mixture Modeling) filtering which is EPO orthogonalizing to all available singular vectors. To perform EPO, a negative integer is supplied in place of (a) where (-a) specifies the number of singular vectors to include in the filter. This is GLSW with a square-wave function for the deweighting i.e., the first a singular values of Sinv are set to zero and the remaining singular values are set to 1. To perform EMM, a negative infinity (-inf) is supplied in place of (a).
Finally, an alternative method to use GLSW is in quantitative analysis where a continuous y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X- and y-blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.
For calibration of the GLSW model modl, inputs can be provided by one of four methods:
1) modl = glsw(x,a)
- x = a clutter data or covariance matrix containing features to be downweighted, and
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
- Note: If x is a dataset with classes, differences within each class are used for down-weighting (i.e., intra-class variance is considered clutter). This reduces intra-class variation but ignores the inter-class variation. Only classes with class numbers >0 are included in the clutter calculation (see DataSet object for more information).
2) modl = glsw(x1,x2,a)
- x1 = a M by N data matrix and
- x2 = a M by N data matrix.
- The clutter is defined as x = x1-x2; the row-by-row differences between x1 and x2. The input data represents two or more measured populations which should otherwise be the same (e.g., the same samples measured on two different analyzers or using different solvents).
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
3) modl = glsw(x,y,a)
- x = a M by N data matrix,
- y = column vector of integers with M rows specifing sample groups in x within which differences should be downweighted.
- Note: This method is identical to method (1) when classes of the X-block are used to identify groups. The only difference is that the groups are identified from the separate input y instead of the dataset classes. If y is empty, this defaults to method (1) without class information where x is then defined as the clutter data matrix.
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
4) modl = glsw(x,y,a)
- x = a M by N data matrix,
- y = column vector with M rows specifying a y-block continuous variable. In this input, the "gradient method" is used to identify similar samples and downweight differences between them. See also the gradientthreshold option below.
- a = scalar regularization parameter that governs downweighting {default = 1e-2}.
The input a can be replaced with an options structure (see Options below).
When applying a GLSW model the inputs are newx, the X-block to be deweighted, and modl, a GLSW model structure.
Outputs are modl, a GLSW model structure, and xt, the deweighted X-block.
Options
An options structure can be used in place of (a) for any call or as the third input in an apply call. This structure consists of any of the fields:
- a: [ 0.02 ] scalar parameter limiting downweighting {default = 1e-2},
- meancenter: [ 'no' | {'yes'} ] For single x-block modes only: governs the calculation of a mean of each group of data before calculating the covariance. If set to no, the filter will include the offset of each group. This is equivalent to saying the offset in the data is part of the clutter which should be removed.
- applymean: [ 'no' | {'yes'} ] governs the use of the mean difference calculated between two instruments (difference between two instruments mode). When appling a GLS filter to data collected on the x1 instrument, the mean should NOT be applied. Data collected on the SECOND instrument should have the mean applied.
- gradientthreshold: [ .25 ] "continuous variable" threshold fraction above which the column gradient method will be used with a continuous y. Usually, when (y) is supplied, it is assumed to be the identification of discrete groups of samples. However, when calibrating, the number of samples in each "group" is calculated and the fraction of samples in "singleton" groups (i.e. in thier own group) is determined.
- fraction = (\# Samples in Singleton Groups) / Total Samples
- If this fraction is above the value specified by this option, (y) is considered a continuous variable (such as a concentration or other property to predict). In these cases, the "sample similarity" (a.k.a. "column gradient") method of calculating the covariance matrix will be used. Sample similarity method determines the down-weighting required based mostly on samples which are the most similar (on the specified y-scale). Set to >=1 to disable and to 0 (zero) to always use.
- maxpcs: [ 50 ] maximum number of components (factors) to allow in the GLSW model. Typically, the number of factors in incuded in a model will be the smallest of this number, the number of variables or the number of samples. Having a limit set here is useful when derriving a GLSW model from a large number of samples and variables. Often, a GLSW model effectively uses fewer than 20 components. Thus, this option can be used to keep the GLSW model smaller in size. It may, however, decrease its effectiveness if critical factors are not included in the model.
- classset: [ 1 ] indicates which class set in x to use when no y-block is provided.