Simca

From Eigenvector Research Documentation Wiki
Revision as of 17:28, 26 August 2021 by Bob (talk | contribs)
Jump to navigation Jump to search

Purpose

Create soft independent method of class analogy models for classification.

Synopsis

model = simca(x,ncomp,options) %creates SIMCA model from dataset x
model = simca(x,classid,labels) %models double x with class id
model = simca(x, modelCell, options) %creates SIMCA model from cell array of PCA submodels built from dataset x
pred = simca(x,model,options); %predictions on x with model
simca % Launches an Analysis window with SIMCA as the selected method.

Description

The function SIMCA develops a SIMCA model, which is really a collection of PCA models, one for each class of data in the data set and is used for supervised pattern recognition.

When optional input ncomp is not supplied in the first syntax example, SIMCA operates in an interactive mode. In this mode, the user is prompted for basic preprocessing and number of components to keep in each model. Individual models are built for each class and the PCA model of each class is cross-validated (using leave-one-out if the number of samples in the class is <= 20 or contiguous blocks if more than 20 samples in a given class).

For more automatic SIMCA model building, please see the pca or simcasub functions.

Inputs

  • x = M x N matrix of class "dataset" where class information is extracted from x.class{1,1} and labels from x.label{1,1}, or an M x N data matrix of class "double"
  • classid = M x 1 vector of class identifiers where each element is an integer identifying the class number of the corresponding sample.
  • model = when making predictions, input model is a SIMCA model structure.

Optional Inputs

  • ncomp = integer, number of PCs to use in each model. This is rarely known a priori. When ncomp=[] {default} the user is querried for number of PCs for each class.
  • labels = a character array with M rows that is used to label samples on Q vs. T2 plots, otherwise the class identifiers are used.

options = a structure array discussed below.

Outputs

  • model = model structure array with the following fields:
    • modeltype: 'SIMCA',
    • datasource: structure array with information about input data,
    • date: date of creation,
    • time: time of creation,
    • info: additional model information,
    • description: cell array with text description of model,
    • submodel: structure array with each record containing the PCA model of each class (see PCA), and
    • detail: sub-structure with additional model details and results.
    • pred = is a structure, similar to model, that contains the SIMCA predictions. Additional, or other, fields in pred are:
    • rtsq: the reduced T2 (T2 divided by it's 95 Found confidence limit line) where each column corresponds to each class in the SIMCA model,
    • rq: the reduced Q (Q divided by it's 95 Found confidence limit line) where each column corresponds to each class in the SIMCA model,
    • nclass: the predicted class number (class to which the sample was closest when considering T2 and Q combined), and
    • submodelpred: structure array with each record containing the PCA model predictions for each class (see PCA),
    • classification: information about the classification of X-block samples (see description at Standard Model). For more information on class predictions, see Sample Classification Predictions.

For more information on class predictions, see Sample Classification Predictions

Note: Calling simca with no inputs starts the graphical user interface (GUI) for this analysis method.

Cross-validation

Using SIMCA from the SIMCA Analysis window does not perform cross-validation at the SIMCA model level. There is a "Cross-Validation" entry in the "Analysis Flowchart" where the user can select the cross-validation method which should be applied in building the PCA sub-models. However, there is no cross-validation performed at the SIMCA model level.

The specific cross-validation settings used for an individual PCA sub-model can be modified to the user's preference in the PCA Analysis window while that PCA model is being fitted, along with any other settings for that PCA model, such as pre-processing method or the number of PCs to use.

Options

options = a structure array with the following fields:

  • display: [ {'on'} | 'off' ], governs level of display,
  • plots: ['none' | {'final'} ], governs level of plotting,
  • staticplots: ['no' | {'yes'} ], produce ole-style "static" plots,
  • rule: [{'combined'} | 'T2' | 'Q' | 'both'], governs how a sample's distance from sub-class is measured. 'Q' means reduced Q is used as distance measure. 'T2' means reduced T2 is used. 'both' means both T2 and Q are used (if either is outside the limit, the sample will be considered outside the class). 'combined' uses sqrt(Q^2 + T2^2), each reduced, as the distance measure,
  • preprocessing: { [ ] }, a preprocessing structure (see preprocess) that is used to preprocess data in each class.
  • classset: [ 1 ] indicates which class set in x to use.

Note: with display='off', plots='none', nocomp=(>0 integer) and preprocessing specified that SIMCA can be run without command line interaction.

Building with cell array of PCA submodels

Individual PCA submodels built from one or more individual sample classes are combined into a cell array to generate the SIMCA model. The following example shows

  • how classes may be combined
  • preprocessing for each submodel may be independent
  • the included variables for each submodel may be independent
load arch
archCal = arch(1:63,:);
% build three separate PCA models
% one for classes K and BL
% one for class SH
% one for class AN
indsKandBL = ismember(archCal.class{1,1}, [1 2]);
indsSH     = ismember(archCal.class{1,1}, 3);
indsAN     = ismember(archCal.class{1,1}, 4);

% preprocessing may be completely independent for the PCA submodels
myPP{1} = preprocess('default', 'autoscale');
myPP{2} = preprocess('default', 'msc', 'autoscale');
myPP{3} = preprocess('default', 'snv', 'autoscale');

myPCAmdl   = evrimodel('pca');
myPCAmdl.x = archCal;

myPCAmdl.x.include{1}          = find(indsKandBL);
myPCAmdl.ncomp                 = 4;
myPCAmdl.options.preprocessing = myPP(1);
myPCAmdlKandBL                 = myPCAmdl.calibrate;

myPCAmdl.x.include{1}          = find(indsSH);
% the number of included variables may also be independent 
% for the PCA submodels
myPCAmdl.x.include{2}          = setdiff(1:10, 9);
myPCAmdl.ncomp                 = 3;
myPCAmdl.options.preprocessing = myPP(2);
myPCAmdlSH                     = myPCAmdl.calibrate;

myPCAmdl.x.include{1}          = find(indsAN);
myPCAmdl.x.include{2}          = setdiff(1:10, [2 9]);
myPCAmdl.ncomp                 = 2;
myPCAmdl.options.preprocessing = myPP(3);
myPCAmdlAN                     = myPCAmdl.calibrate;

mySIMCAmdl = simca(archCal, {myPCAmdlKandBL myPCAmdlSH myPCAmdlAN});
mySIMCAmdl.plotscores

See Also

analysis, cluster, crossval, discrimprob, knn, modelselector, pca, plsda, svmda, SIMCA_Model_Builder_GUI