Declutter Settings Window: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
imported>Jeremy
Line 12: Line 12:
* '''y-block gradient''': Used in regression or classification models when the y-block contains information on which samples are related to each other. When the y-block is "discrete" (a small number of unique values), this filter behaves the same as the "x-block classes" method described above. When the y-block is "continuous" (a range of different values are present with small and large variations between the different samples), the y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.  
* '''y-block gradient''': Used in regression or classification models when the y-block contains information on which samples are related to each other. When the y-block is "discrete" (a small number of unique values), this filter behaves the same as the "x-block classes" method described above. When the y-block is "continuous" (a range of different values are present with small and large variations between the different samples), the y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.  
* '''automatic''' : automatically chooses between x-block classes and y-block gradient modes depending on the information available. If a y-block is present (e.g. in regression models), the y-block gradient method is used. Otherwise, x-block classes are used (if present). If neither is present, no filtering is done.
* '''automatic''' : automatically chooses between x-block classes and y-block gradient modes depending on the information available. If a y-block is present (e.g. in regression models), the y-block gradient method is used. Otherwise, x-block classes are used (if present). If neither is present, no filtering is done.
* '''external data''' : user-defined set of samples which should be considered "clutter" and all features within this data should be down-weighted. The number of variables in the loaded data must match the number of variables in the data to which the filter will be applied.
* '''external data''' : allows use of a user-supplied set of data which should be considered "clutter". All features within this data will be considered when building the filter. The "Load" button allows loading of data from the workspace or a file (to import data, use the workspace browser to import the data into the main workspace first, then load from there.) Note that the number of variables in the loaded data must match the number of variables in the data to which the filter will be applied. If the loaded data is a DataSet object, the Edit button can be used to review and modify the data.


===Algorithm===
===Algorithm===

Revision as of 13:39, 12 August 2009

GLSW / EPO Settings GUI How To

The GLSW / EPO Settings GUI allows modification of the settings for the Generalized Least Squares Weighting (GLSW) and External Parameter Orthgonalization (EPO) filters.

GLSW and EPO are "covariance filters" which identify patterns in the variables of the data which should be down-weighted or removed. Covariance filters are an effective way to remove interfering signal (known as "clutter") from data prior to building a model. The algorithmic details of the GLSW and EPO filters can be found on the glsw page. This page describes the different options controlled by the settings GUI.

Clutter Source

GLSW and EPO require identification of "clutter" signal which you want removed from your data. There are four options to identify the clutter source:

  • x-block classes: Used when the data contains classes which define sets of "similar" samples. The filter will down-weight features which make the data diverse within each class. The first class set defined for the x-block rows (samples) will be used to group samples. Class zero will be ignored, but all other classes will be combined and the differences within each class will be used to create the filter.
  • y-block gradient: Used in regression or classification models when the y-block contains information on which samples are related to each other. When the y-block is "discrete" (a small number of unique values), this filter behaves the same as the "x-block classes" method described above. When the y-block is "continuous" (a range of different values are present with small and large variations between the different samples), the y-variable is used to develop pseudo-groupings of samples in X by comparing the differences in the corresponding y values. This is referred to as the "gradient method" because it utilizes a gradient of the sorted X and y blocks to calculate a covariance matrix. For more information on this method, see the chapter discussing Preprocessing in the PLS_Toolbox Manual.
  • automatic : automatically chooses between x-block classes and y-block gradient modes depending on the information available. If a y-block is present (e.g. in regression models), the y-block gradient method is used. Otherwise, x-block classes are used (if present). If neither is present, no filtering is done.
  • external data : allows use of a user-supplied set of data which should be considered "clutter". All features within this data will be considered when building the filter. The "Load" button allows loading of data from the workspace or a file (to import data, use the workspace browser to import the data into the main workspace first, then load from there.) Note that the number of variables in the loaded data must match the number of variables in the data to which the filter will be applied. If the loaded data is a DataSet object, the Edit button can be used to review and modify the data.

Algorithm

Two algorithms are available:

  • GLSW - Generalized Least Squares Weighting: this algorithm performs a "soft" orthogonalization to the clutter. Essentially, a PCA model is created from the clutter and the variance found is down-weighted. The value for alpha defines the extent to which the clutter directions are down-weighted. As alpha gets larger, the extent of deweighting decreases. As a gets smaller (e.g. 0.1 to 0.001) the extent of deweighting increases. A good initial guess for alpha is 0.02 but good results will depend on the covariance structure of the clutter and the specific application. It is recommended that a number of different values be investigated using some external cross-validated metric for performance evaluation.
  • EPO - External Parameter Orthogonalization: this algorithm performs a "hard" orthogonalization to the clutter. A PCA model is calculated for the clutter and the given number of PCs are extracted. The filter then orthogonalizes (removes) all the variance which matches these PCs. If the selected number of PCs is large, more variance will be removed and the filter may remove variance which is not clutter, but part of the signal of interest. Note that EPO is similar to performing GLSW with a very small alpha value, but not exactly the same because EPO only uses the specified number of PCs whereas GLSW uses a weighted set of many PCs.

In addition the user can select whether or not to Mean Center Groups. This checkbox is generally left ON, to indicate that the variance within each clutter group should be removed and the mean of each group be ignored. If unchecked, the filter will include the offset of each group. This is equivalent to saying the offset in the data is part of the clutter which should be removed.