Difference between revisions of "Auto"

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
(Options)
(See Also)
 
Line 43: Line 43:
 
===See Also===
 
===See Also===
  
[[gscale]], [[gscaler]], [[medcn]], [[mncn]], [[normaliz]], [[npreprocess]], [[regcon]], [[rescale]], [[scale]], [[snv]]
+
[[gscale]], [[gscaler]], [[medcn]], [[mncn]], [[normaliz]], [[npreprocess]], [[regcon]], [[rescale]], [[scale]], [[snv]], [[madc]]

Latest revision as of 10:47, 5 December 2019

Purpose

Autoscales a matrix to mean zero and unit variance.

Synopsis

[ax,mx,stdx,msg] = auto(x,options)
[ax,mx,stdx,msg] = auto(x,offset)

Description

[ax,mx,stdx] = auto(x); autoscales a matrix (x) and returns the resulting matrix (ax) with mean-zero unit variance columns, a vector of means (mx) and a vector of standard deviations (stdx) used in the scaling. Output (msg) returns any warning messages. If missing data NaNs are found, the available data is autoscaled if the fraction missing is not above the thresholds specified below. (mx) and (stdx) can be used to scale new data (see SCALE). Optional input (offset) is a scalar offset to add to the standard deviations to avoid divide by zero. Optional input (options) is described below.

Options

options = a structure array with the following fields:

  • offset: scaling can use standard deviation plus an offset {default = 0}. This can be used to avoid divid by zero errors.
  • display: [ {'off'}| 'on' ] governs level of display to the command window.
  • matrix_threshold: fraction of missing data allowed based on entire matrix (x) {default = 0.15}.
  • column_threshold: fraction of missing data allowed base on a single column {default = 0.25}.
  • algorithm: [ {'standard'} | 'robust'] scaling algorithm. 'robust' uses MADC for scaling and median instead of mean. Should be used for robust techniques. The MADC function is a scale estimator given by the Median Absolute Deviation (with finite sample correction) and is part of the LIBRA package included in PLS_Toolbox/Solo. It is defined as
   madc(x)= b_n 1.4826 med(|x_i - med(x)|)

with b_n a small sample correction factor (b_n=n/(n-0.8) for n>9) to make the mad unbiased at the normal distribution.

  • stdthreshold: [ 0 ] scalar or vector of standard deviation threshold values. If a standard deviation is below its corresponding threshold value, the threshold value will be used in lieu of the actual value. Note that the actual standard deviation is always returned, whether or not it exceedes the threshold. A scalar value is used as a threshold for all variables,
  • badreplacement: [0] value to use in place of standard deviation values of 0 (zero). Typical values used with the following effects:
0 = Any value in given variable is set to zero. Variable is effectively excluded (but still expected by model). This is also the behavior when badreplacement = inf.
1 = Values different from mean of the given variable are flagged in Q residuals with no reweighting.
Values >0 and <inf give the variable different weighting in the Q residuals (values >1 down-weight the bad variables for Q residual calculations, values <1 up-weight the bad variables.).

If the input (offset) is a scalar then, this is used as the offset value with other options set at their default values.

The optional input offset is added to the standard deviations before scaling and can be used to suppress low-level variables that would otherwise have standard deviations near zero.

The default options can be retreived using: options = auto('options');.

See Also

gscale, gscaler, medcn, mncn, normaliz, npreprocess, regcon, rescale, scale, snv, madc