Standard Model Structure

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search

Introduction

Many higher-level PLS_Toolbox functions such as PCA and PLS output the results of analyses in a standard model format. This format, the Standard Model Structure, contains the key results of an analysis, as well as information needed to reproduce those results. The Standard Model Structure is typically used by PLS_Toolbox functions to review the model calibration or apply it to new data (for example, predicting on unknown data). However, the experienced user can access much of the information in the model from the command line to perform various tasks. This section describes the standard model structure in general and also shows some of these advanced applications.

As of Version 7.0 of PLS_Toolbox and Solo, the models are accessed as EVRIModel Objects which, in addition to providing the same content access described below, also provide some easy-to-use methods and properties for building, manipulating, and reviewing models from Matlab's command-line, scripts, and functions.

Using Structures

As the name implies, Standard Model Structures are created with a MATLAB variable type called a “structure.” In a MATLAB structure, a single variable contains one or more named "fields" (consider it like a variable containing other variables). These fields are accessed by giving the base variable name (here called "model") followed by a period and the name of the field. For example:

model.date

returns the contents of the date field from the model.

In the case of Standard Model Structures, the fields which show up are standardized to help the user identify useful parts of any model. Some of the fields of the Standard Model Structure have self-explanatory contents, such as description strings; other fields contain data matrices with varying contents. Additionally, some Standard Model Structure fields contain MATLAB cell arrays. A cell array is a container for matrices or strings that can be indexed into. When cell arrays are used, they are denoted either by the word "cell" or by surrounding the size or content with curly brackets, { }. For more information about indexing into cell arrays, see the MATLAB documentation on cells; in general, they are indexed using curly brackets giving the indices of the cell to extract, e.g., mycell{2,3}

One special field in all Standard Model Structures is the detail field. This field contains a separate structure with lower-level details about the model. The contents of this field vary between different model types and typically contains important details about the model. The contents are accessed by simply addressing first the top-level variable (e.g. “model”), then the detail field, then finally the sub-field to extract. For instance, to access the sum squared captured information (ssq table) for a pca model you would use the following:

model.detail.ssq

Although the exact fields present in any model vary depending on the model type, the following example, a PCA model, shows many of the general model characteristics.

PCA Example

model = 
       modeltype: 'PCA'
      datasource: {[1x1 struct]}
            date: '29-Mar-2004'
            time: [2004 3 29 8 57 37.4400]
            info: 'Scores are in cell 1 of the loads field.'
           loads: {2x1 cell}
            pred: {[]}
            tsqs: {2x1 cell}
    ssqresiduals: {2x1 cell}
     description: {3x1 cell}
          detail: [1x1 struct]

model.datasource{1}=
       name: 'Wine'
     author: 'B.M. Wise'
       date: [2001 5 14 13 47 53.9500]
    moddate: [2001 6 6 10 27 23.5100]
       size: [10 5]

model.detail=
             data: {[]}
              res: {[]}
              ssq: [5x4 double]
            rmsec: []
           rmsecv: []
            means: {[]}
             stds: {[]}
           reslim: {[3.1403]}
           tsqlim: {[10.0327]}
           reseig: [3x1 double]
               cv: ''
            split: []
             iter: []
           includ: {2x1 cell}
            label: {2x1 cell}
        labelname: {2x1 cell}
        axisscale: {2x1 cell}
    axisscalename: {2x1 cell}
            title: {2x1 cell}
        titlename: {2x1 cell}
            class: {2x1 cell}
        classname: {2x1 cell}
    preprocessing: {[1x1 struct]}
          options: [1x1 struct]

Description

model

  • modeltype: Contains a keyword defining the type of model.
  • datasource: Contains information about the data used for the model, but does not contain original data. datasource basically contains a subset of the fields of the original DataSet Object. If more than one block of data is used for a model (e.g., PLS models require two blocks), datasource will be size 1xj, where j is the number of blocks used for the model (1x2 in the case of PLS).
  • date: Contains a string description of the date on which the model was calculated.
  • time: Contains a time-stamp vector including, in order, the year, month, day, hour, minute, and second at which the model was created. Useful for distinguishing between different models, as no two models can be created with the same time-stamp.
  • loads: Contains factors recovered for the data. In the case of PCA or PLS models, the rows are the "scores" and "loadings" of the data. For PARAFAC, it is the loadings of each mode. The size of the loads field will be dxj, where d is the number of modes (dimensions) and j is the number of blocks of data used by the model (Table 1). PCA models will be 2x1 (two modes, one block), PLS models will usually be 2x2 (two modes, two blocks), PARAFAC will often be dx1 where d is at least 3. Other model types follow this same pattern.


Loads field contents.
Model Type Field loads contains:
PCA, MCR { X-block Scores X-block Loadings }
PLS, PCR { X-block Scores Y-block Scores
X-block Loadings Y-block Loadings }
PARAFAC { X-block Loadings (mode 1)
X-block Loadings (mode 2)
X-block Loadings (mode 3)
...}


The contents of each cell will generally be an mxk array in which m is the size of the given block for the given mode (e.g., for loads{1,1}, m would be equal to the size of the first mode of the X-block) and k is the number of components or factors included in the model. This format may be different for different model types.

Examples of indexing into the loads field:

For a PCA model, a plot of the scores for all principal components can be made using:
plot(model.loads{1});
To plot the scores for only the first principal component, type:
plot(model.loads{1}(:,1));
Similarly, a plot of the loadings for the first principal component can be made using:
plot(model.loads{2}(:,1);
You can also use the model to export data so that they can be used in other programs. The following could be used to export an XY text-formatted file of the loadings of the first principal component:
xy=[ [1:size(model.loads{2},1)]'  model.loads{2}(:,1) ];
save xy.prn xy –ascii
If the axisscale field for this data contains a useful axis to plot against, the XY data can be created using
xy=[model.detail.axisscale{2}(model.detail.includ{2})' model.loads{2}(:,1)];
Similarly, the axisscale could be used in the plot commands given above.
The goal of PCA is to recreate a data set from its loadings and scores. When the proper number of principal components is used one can reproduce the original data set within the noise level. The reconstructed data is calculated as follows:
data_reconstructed = model.loads{1}*model.loads{2}';
It is also possible to reconstruct the data with fewer factors. The general command for n principal components is:
model.loads{1}(:,1:n) * model.loads{2}(:,1:n)';
  • tsqs: Contains Hotelling's T2 values. The T2 values are a distance measure of the scores or loadings with respect to their respective origins. The tsqs field is a cell equal in size and similar in definition to the loads field. The field will be size dxj, where d is the number of modes (dimensions) and j is the number of blocks of data used by the model. Row 1 contains the mode 1 T2 values (samples), row 2 contains the T2 values for mode 2 (variables), etc., while column 1 contains T2 values for the X-block and column 2 contains T2 values for the Y-block (if any was used for the given model type).
A relatively large value in tsqs{1,1} indicates a sample that has relatively large influence and loading on the PCA model. Similarly, large values in tsqs{2,1} indicate variables with significant influence and loading on the model. These are often used with statistical limits given in the details field and discussed below.
When a model is applied to new data, the output prediction structure will contain the T2 values for the predictions rather than the calibration data.
  • ssqresiduals: Contains the sum-of-squares (SSQ) differences between the original data and the reconstructed data. The field will be size dxj, where d is the number of modes (dimensions) and j is the number of blocks of data used by the model. The first row of the cell contains sum-of-squares differences for each sample (indicating samples which have large out-of-model residuals), while the second row contains sum-of-squares differences for each variable (indicating variables which have large out-of-model residuals). Again, each block is represented by a column, so the X-block sample sum-of-squares residuals can be extracted using:
model.ssqresiduals{1,1}
When a model is applied to new data, the output prediction structure will contain the SSQ residuals for the predictions rather than the calibration data.
  • classification: Present for classification models (PLSDA, SVMDA, KNN, SIMCA), this field allows the results of classifier methods to be compared more easily. The field is a struct which contains information about the classification of each X-block sample, These are calibration samples in the case of a model and test samples in the case of a pred structure. Note that these classification is based on Y-block predictions. If cross-validation has been applied to the model then there is another field model.detail.cvclassification which holds similar classification information but which is based on cross-validation predictions. The classification struct has the following fields:
  • probability = Probability by class (all probabilities). It is an array, nsamples x nclasses which gives the probability of each sample belonging to each class. The columns are in order given by the classnums/classids field. This is calculated as;
SVMDA: probability is provided by the underlying LIBSVM as described in the “Class prediction probabilities” section of SVMDA.
PLSDA: probability derived from predicting 0/1 for not-in-class/in-class.
KNN: probability equals fraction of nearest neighbors which have this class.
SIMCA: probability for each class obtained from tha sample’s Q and Tsquared from the class’ sub-model.
These calculations are described in more detail at Sample_Classification_Predictions#Class_Probability_Calculation
  • mostprobable = Most probable class (largest probability ONLY). It is a vector of length nsample with values equal to the most probable class for each sample.
  • inclass = Strict in class (probability > strictthreshold in only ONE class). The probability threshold is set by option strictthreshold for classifier methods, with default = 0.5. inclass is a vector of length nsample with values equal to the most probable class for each sample provided this probability is > strictthreshold. If there is no class with probability > strictthreshold, or there are more than one classes with probability > strictthreshold for a sample then the sample’s inclass value is zero.
  • inclasses = Strict multi-class (probability > strictthreshold in one or more classes). It is an array, nsamples x nclasses which has value 1 for classes where a sample has probability is > strictthreshold, or zero otherwise.
  • classnums = The unique, non-zero classes in the calibration data sorted in ascending order. These are the classes of each column in .probability and .inclasses.
  • classids = The classes IDs in the calibration data corresponding to classnums.
Example: model.classification for a model predicting 4 classes has fields
Probability =
0.93  0.36  0.14  0.27
0.87  0.46  0.32  0.18
0.19  0.93  0.30  0.39
0.27  0.21  0.84  0.39
0.20  0.27  0.92  0.82
mostprobable = [ 1 1 2 3 3]
inclass = [ 1 1 2 3 0]
inclasses =
1  0  0  0
1  0  0  0
0  1  0  0
0  0  1  0
0  0  1  1
classnums = [1 2 3 4]
classids = {'Class 1' 'Class 2' 'Class 3' 'Class 4'}
For additional information on how to use these fields, see Sample Classification Predictions.

model.detail

  • history: A cell array containing an entry log of datetime and changes made to the model.
  • data: Contains the data used to calculate the model. In order to save memory, many methods do not save the primary (X) block in the first cell. See the options for a given analysis method to learn more about storing the X-block data in the model (see, for example, the blockdetails option for PCA and PLS). Y-block data (i.e., model.detail.data{2} ) is usually stored, when available.
  • res: Contains the residuals calculated as the difference between the reconstructed data and the original data, res = y_pred - y_obs. As with the detail.data field, the X-block residuals are not usually stored, due to excessive size. As such, this field may be empty. If using a two-block method such as PLS, the second cell of model.detail.res will usually contain the prediction residuals.
  • rmsec and rmsecv: Contain any root-mean-square error of calibration and cross-validation (respectively) that was calculated for the model. These are vectors equal in length to the maximum number of factors requested for a given cross-validation calculation. A value of NaN will be used for any value which was not calculated.
  • ssq: Contains the eigenvalues or other variance-captured information for a model. For PCA models–, the first column is the PC number, the second column contains the eigenvectors, the third column contains the eigenvalues scaled to a sum of 100, and the last column contains the cumulative eigenvalues. As such, the eigenvalues can be extracted using:
model.detail.ssq(:,3)
For PLS models, the first column is the Latent Variable number, followed by the variance captured for the X-block, the cumulative variance captured for the X-block, the variance captured for the Y-block, and the cumulative variance captured for the Y-block.
Users should be advised that other model types use this field in different ways. For more information contact Eigenvector Research at helpdesk@eigenvector.com.
  • reslim: Contains the 95% confidence limit for the block 1 sample Q residuals; i.e., for the values in model.ssqresiduals{1}. For PCA, this value is calculated using :
residuallimit(model,.95)
For some multiway model types this field may contain a structure with additional information. For help on this topic, contact Eigenvector Research at helpdesk@eigenvector.com.
  • tsqlim: A cell containing the 95% confidence limit for the block 1 sample T2 values in model.tsqs. For PCA, this value is calculated using
[n_samples,n_pcs]=size(model.loads{1});
tsqlim(n_samples,n_pcs,.95);
For some multiway model types, this field may contain a structure with additional information. For help on this topic, contact Eigenvector Research at helpdesk@eigenvector.com.
  • reseig: Contains the residual eigenvalues for PCA models only.
  • cv, split, iter: Contain information about the settings used when cross-validating the model, including:
               cv: the encoded cross-validation method 
            split: the total number of splits (test groups)
             iter: the number of iterations (for random mode only)
  • include, label, labelname, axisscale, axisscalename, title, titlename, class, classname: Contain information from DataSet Object fields. If the original data used to calculate the model were stored in a DataSet Object, certain fields from the DataSet Object are extracted and stored in the model. Each of these fields is extracted for each mode and each block such that model.detail.label{1,2} returns the mode 1 labels for block 2 (Y-block), and model.detail.class{2,1} returns the mode 2 classes for block 1 (X-block). For more information on these fields, see the DataSet Object section at the beginning of this chapter.
Note that although DataSet Objects use the second index of many of these fields to store multiple "sets," the model structures indicate the block number with the second index, and the set number with the third index.
  • preprocessing: Contains a cell of preprocessing structures. The preprocessing structures are used with the function preprocess to process the data for the model. There is one structure in the cell for each block required by the model and each structure may have zero, one or multiple records, with each record representing one preprocessing method applied to the data. Thus, the X-block preprocessing is stored in model.detail.preprocessing{1}.
  • options: Contains the options given as input to the modeling function.
  • cvclassification: If cross-validation has been applied to the model then this field is populated. It holds similar classification information to model.classification but is based on cross-validation predictions instead of normal predictions.

Summary

Similar to a DataSet Object, the Standard Model Structure organizes results of an analysis into a logical hierarchy. From this structure, a user can both explore results and generate predictions. As you become more familiar with its organization, you will see how convenient it is to access and manipulate model information using the Standard Model Structure.