Ipls

From Eigenvector Research Documentation Wiki
Revision as of 06:38, 9 March 2022 by Rasmus (talk | contribs) (→‎Description: Nørgaard spelled wrong)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Purpose

IPLS Interval PLS and forward/reverse MLR variable selection.

Synopsis

results = ipls(X,Y,int_width,maxlv,options)
results = ipls(X,Y,int_width,maxlv,numintervals,options)
[use,fit,lvs,intervals,intcv,intlv] = ipls(X,Y,int_width,maxlv,options)

Description

Performs forward or reverse selection of variable windows based on the RMSECV obtained for each individual window ("intervals") of variables. The interval which provides the lowest RMSECV is selected. Multiple windows can also be selected iteratively by modifying the options.numintervals options. The "algorithm" option allows this function to behave as an IPLS or IPCR algorithm or a forward/reverse MLR variable selection algorithm. The default is PLS but options.algorithm = 'mlr' changes to MLR mode. See other options below and a description of the algorithm and use in Interval PLS for Variable Selection.

Inputs are (X,Y) the X and Y data, (int_width) the interval i.e. window width in variables and (maxlv) the maximum number of latent variables to use in any model (maxlv has no impact if options.algorithm = 'mlr'). Note that excluding a variable in X will prevent it from being used in any model.

If options.plots is 'final', a plot is given of the minimum RMSECV versus window center. Windows which were used are indicated in blue, windows which were excluded are indicated in red. The number of latent variables (LVs) used to assess each interval (the model size that gives the indicated RMSECV) is shown at the bottom of each interval's bar, inside the axes. The best RMSECV that can be obtained using all intervals is shown as a dashed red line (all-interval RMSECV). The number of LVs used in this model is shown on the right of the axes. If this number of LVs (all-interval model) is different from the number used for the best model of the selected interval(s) (selected-interval model) then a dashed magenta line will indicate the RMSECV obtained when using all intervals but at the selected-interval model size. The mean sample is superimposed on the plot for reference.

For more information on this method see the following paper:

L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, and S.B. Engelsen, “Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy” Applied Spectroscopy, 54 (3), 413-419, 2000

Inputs

  • X = X-block,
  • Y = Y-block, and
  • int_width = the interval (window width in variables)
  • maxlv = the maximum number of latent variables to use in any model.

NOTE: that excluding a variable in X will prevent it from being used in any model.

Outputs

When a single output is requested, the output is a structure with the following fields:

  • use: the final selected indices which gave the best model.
  • fit: the RMSECV for the selected indicies.
  • lvs: the number of latent variables which gives the best fit.
  • intervals: a matrix containing the indicies used for each interval.
  • intcv: the RMSECV in the last selection cycle for all intervals (these values were used to select the last interval).
  • intlv: the number of latent variables used in the model which gave the RMSECV values returned in intcv.
  • figh: figure handle of plot if opitons.plots = final.

Optionally, with multiple outputs, these variables will be returned as single outputs (not in structure format) in the order shown above.

Options

  • options = options structure containing the fields:
  • display: [ 'off' | {'on'} ], governs level of display to command window,
  • plots: [ 'none' | {'final'} ], governs level of plotting,
  • mode: [{'forward'} | 'reverse' ] Defines action to be performed with each interval.
  • 'forward' mode: the RMSECV calculated for each interval represents how well the y-block can be predicted using ONLY the variables included in the interval.
  • 'reverse' mode: the RMSECV calculated for each interval represents how well the y-block can be predicted when the given interval of variables are removed from the range of included X variables.
NOTE: that excluding a variable in X will prevent it from being used in any model.
  • algorithm: [{'pls'} | 'pcr' | 'mlr' ] Defines regression algorithm to use. Selection is done for the specific algorithm. Note that when MLR is used, input (int_width) is most often = 1 (single variable per window). iPLSDA (discriminant analysis) mode can be invoked by using algorithm='pls' and passing a logical y-block (see class2logical).
  • numintervals: { [1] } Number of intervals to select or remove. If (num_intervals) is Inf, intervals are iteratively selected and added/removed until no improvement in RMSECV is observed. NOTE: this can also be set by passing as a scalar value before, or in place of, the options structure. When passed this way, any value passed in the options structure will be ignored.
  • mustuse: [ ] A vector of variable indices which MUST be used in all models. These variables will always be included in any model, whether or not they are included in the current interval.
  • stepsize: [ ] Distance between interval centers. An empty matrix gives the default spacing in which intervals do not overlap (stepsize = int_width).
  • preprocessing: defines preprocessing and can be one of the following:
  • (a) One of the following strings:
  • 'none'  : no preprocessing {default}
  • 'meancenter' : mean centering
  • 'autoscale'  : autoscaling
  • (b) A single preprocessing structure defined using the function preprocess. The same preprocessing structure will be used on both the X and Y blocks.
  • (c) A cell containing two preprocessing structures {pre pre} one for the X block and one for the Y block.
  • cvi: {'vet' [ ] 1} Three element cell indicating the cross-validation leave-out settings to use {method splits iterations}. For valid modes, see the "cvi" input to crossval. If splits (the second element in the cell) is empty, the square root of the number of samples will be used. cvi can also be a vector (non-cell) of indices indicating leave-out groupings (see crossval for more info).
  • plottype: [ 'bar' | {'patch'}] Governs type of plot to make. Bar plots may not handle non-linear axisscales well, but allows for backwards compatibility.

See Also

selectvars, gaselctr, genalg, sratio, vip, Interval PLS (IPLS) for Variable Selection, rpls, Sample and Variable Selection, Variable Selection