Selectvars

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search

Purpose

SELECTVARS selects variables that are predictive.

Synopsis

results = selectvars(X,Y,maxlv,options)

Description

Perform model-based variable selection using PLS and iteratively a X by comparing RMSECV values, analyzing the scores, and removing the variables with the lowest influence with respect to prediction.

Inputs are (X,Y) the X and Y data, (maxlv) the maximum number of latent variables to be used, (options) is the options structure for selectvars.

The automated variable selection tries variable selection using both VIP and selectivity ratio (SR) and then only presents the selection leading to the best RMSECV. For both VIP and selectivity ratio the following approach is adopted. In the first run, the variables with the R percent lowest VIP (or SR) values are eliminated. If the model improves, this is repeated and continuously so until the model doesn’t improve.

For certain types of data, it is best to remove a large fraction in each run and for other types of data, a smaller fraction should be removed. In order to test which fraction is appropriate to remove, the removal is simply done using a number of different fraction. The values are given in the option .fractionstotest and the default ones are [2 5 8 10 15 20 25 30 35 40 45]/100. Hence, from 2% to 45%.

The iterative improvement is done for each of these and only the results with best RMSECV is used. To avoid overfitting, the setting relativeimprovementtocontinue can be used to require that the model needs to improve by a certain fraction in order to continue removing variables. The default setting is zero. Hence, the algorithm will continue as long as results do not get worse.

The algorithm will stop for each trial after a number of iterations. Default is 20.


Inputs

  • X = X-block may be either a matrix or a dataset object.
  • Y = Y-block may be either a matrix or a dataset object.
  • maxlv = the maximum number of latent variables to be used in the PLS models within.
  • options = options structure for selectvars. (optional).

Outputs

The output is a results structure with the following fields:

  • use: The final selected indices which gave the best model.
  • fit: The RMSECV for the selected indices.
  • lvs: The number of latent variables which gave the best fit.
  • intervals: A cell array containing the indices used in each interval.
  • rmsecv: The rmsecv in the last selection cycle for all intervals.
  • numlv: The number of latent variables used in the model which gave the RMSECV values returned in numcv.
  • figh: Figure handle of the plot that is produced if options.plots = ‘final’.

Options

  • options = options structure containing the fields:
  • display: [ {'off'} | 'on'] Governs screen display.
  • plots: [{‘final’}|’off’], governs level of plotting.
  • method: [{‘auto’}|'vip'|’sratios’], defines the method of choice as a metric for variable selection for the regression models:
  • auto mode: When set to 'auto', the best results between vip & sratios is automatically chosen. Moreover the best fraction to remove is automatically chosen (Hence, fractiontoremove is not used).
  • vip mode: uses Variable Importance in Projection algorithm.
  • sratios mode: uses Selectivity Ratios.
  • fractiontoremove: (default = 0.1) Determines the fraction size to remove with each iteration.
  • relativeimprovementtocontinue: (default = 0) Relative improvement with each iteration that is required before the variable selection iteration process stops. For example, when 0.05 a 5% improvement is required. When 0, the search for a better model continues as long as the current RMSECV is not worse than the prior.
  • cvsplit: [method, splits] (default = [‘vet’ 6]) determines crossval method [{‘vet’}|’loo’|’con’|’rnd’] and the number of splits.
  • cvopts: options structure for CROSSVAL function.
  • plsopts: options structure for PLS function.
  • preprocessing : {[] []} preprocessing structures for x and y blocks used in PLS and crossval.
  • maxiter: (default = 20) the maximum number of iterations before terminating the iteration loop.
  • waitbar: [{‘on’} | ‘off’] Governs the use of a waitbar to show progress.

See Also

gaselctr, genalg, ipls, rpls, sratio, vip, crossval, Interval PLS (IPLS) for Variable Selection