https://www.wiki.eigenvector.com/api.php?action=feedcontributions&user=Lyle&feedformat=atomEigenvector Research Documentation Wiki - User contributions [en]2020-02-23T19:17:02ZUser contributionsMediaWiki 1.32.0https://www.wiki.eigenvector.com/index.php?title=ModelBuilding_Biplot&diff=11027ModelBuilding Biplot2020-02-13T20:54:21Z<p>Lyle: Created page with "__TOC__ Table of Contents | Previous ==Scores and Loadings Biplots for a Calibration Model== For most analysis methods,..."</p>
<hr />
<div>__TOC__<br />
<br />
[[TableOfContents|Table of Contents]] | [[ModelBuilding_PlottingLoads|Previous]]<br />
<br />
==Scores and Loadings Biplots for a Calibration Model==<br />
<br />
For most analysis methods, the Analysis window toolbar contains a Scores and loadings biplots button [[Image:Biplot_button.png|35x34px]]. The Scores and loadings biplots is used to plot the Sample Scores and Variable Loadings on the same plot. The biplot helps identify any trends between the samples and the variables. This plots works best with a small number of variables.<br />
<br />
::The figure below shows the biplot for the Arch demo dataset for PC 1 vs PC 2 after removing the unknown samples and building a PCA model with 4 PCs.<br />
<br />
::[[Image:Biplot_arch_all_2.png|600x400px]]<br />
<br />
The blue triangles show the variable loadings and the red diamonds show the sample scores. <br />
<br />
'''Note:''' For information about the Plot Controls window and Plot window, see [[PlotControlsWindow_Layout|Plot Controls Window]].</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=File:Biplot_arch_all_2.png&diff=11026File:Biplot arch all 2.png2020-02-13T20:09:05Z<p>Lyle: </p>
<hr />
<div></div>Lylehttps://www.wiki.eigenvector.com/index.php?title=File:Biplot_arch_All.png&diff=11025File:Biplot arch All.png2020-02-13T19:00:29Z<p>Lyle: </p>
<hr />
<div></div>Lylehttps://www.wiki.eigenvector.com/index.php?title=File:Biplot_Arch_PlotControls.png&diff=11024File:Biplot Arch PlotControls.png2020-02-13T18:55:53Z<p>Lyle: </p>
<hr />
<div></div>Lylehttps://www.wiki.eigenvector.com/index.php?title=File:Biplot_Arch.png&diff=11023File:Biplot Arch.png2020-02-13T18:52:58Z<p>Lyle: </p>
<hr />
<div></div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Model_Building:_Calibration_Phase&diff=11022Model Building: Calibration Phase2020-02-13T16:58:56Z<p>Lyle: </p>
<hr />
<div>__TOC__<br />
<br />
[[TableOfContents|Table of Contents]] | [[ModelBuilding_AnalysisPhasesOverview|Previous]] | [[ModelBuilding_PlottingEigenValues|Next]]<br />
<br />
==Building the Model in the Calibration Phase==<br />
<br />
Regardless of the analysis method, building a model in the Calibration phase consists of a series of the same general steps, with the second and third steps being iterative, until you are satisfied with your model. These steps are:<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|1.<br />
<br />
|Loading the calibration data and building the initial model. See [[ModelBuilding_CalibrationPhase#Loading the calibration data and building the initial model|Loading the calibration data and building the initial model]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|2.<br />
<br />
|Changing the number of components or factors that are to be retained in the model and recalculating the model. See [[ModelBuilding_CalibrationPhase#Changing the number of components|Changing the number of components]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|3.<br />
<br />
|Examining the model and refining the model by excluding certain samples and/or variables to enhance the model performance. See [[ModelBuilding_CalibrationPhase#Examining and refining the model|Examining and refining the model]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|4.<br />
<br />
|After you are satisfied with the model, you can then do one of the following:<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Save the model and use it at a later date. <br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Load validation and test data and apply the model immediately. <br />
<br />
|}<br />
<br />
'''Note:''' Decomposition and Clustering analysis methods require only x block data for model building in the Calibration phase. Regression analysis methods require both x block data and y block data. Classification analysis methods require x block data with classes in either X or Y. For simplicity and brevity, this section describes model building during the Calibration phase using default preprocessing methods for a simple PCA model; however, all of the general information in this section is applicable for all analysis methods. <br />
<br />
'''Note:''' Although this section describes model building using default preprocessing methods, remember, for most analyses, it is critical to select the appropriate preprocessing methods for the data that is being analyzed. To review detailed information about preprocessing, see [[ModelBuilding_PreProcessingMethods|Preprocessing Methods]].<br />
<br />
'''Note:''' To review a detailed description of the Calibration phase, see [[ModelBuilding_AnalysisPhasesOverview| "Analysis Phases."]]<br />
<br />
===Loading the calibration data and building the initial model===<br />
<br />
You have a variety of options for opening an Analysis window and loading data. Because these methods have been discussed in detail in other areas of the documentation, they are not repeated here. Instead, a brief summary is provided with a cross-reference to the detailed information. Simply choose the method that best fits your working needs.<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* To open an Analysis window:<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* In the Workspace Browser, click the shortcut icon for the specific analysis that you are carrying out.<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* In the Workspace Browser, click Other Analysis to open an Analysis window, and on the Analysis menu, select the specific analysis method that you are carrying out.<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* In the Workspace Browser, drag a data icon to a shortcut icon to open the Analysis window and load the data in a single step.<br />
<br />
|}<br />
<br />
:'''Note:''' For information about working with icons in the Workspace Browser, see [[WorkspaceBrowser_DataIcons|Icons in the Workspace Browser]].<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* To load data into an open Analysis window:<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Click File on the Analysis window main menu to open a menu with options for loading and importing calibration data.<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Click the appropriate calibration control to open the Import dialog box and select a file type to import.<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Right-click the appropriate calibration control to open a context menu with options for loading and importing data.<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Right-click on an entry for a cached item Model Cache pane to open a context menu that contains options for loading the selected cached item into the Analysis window.<br />
<br />
|}<br />
<br />
:'''Note:''' For information about the data manipulation options on the context menu, see [[WorkspaceBrowser_DataIcons|Icons in the Workspace Browser]] or [[WorkspaceBrowser_ImportingData|Importing Data into the Workspace Browser]]. For information about loading items from the Model Cache pane, see [[AnalysisWindow_ModelCachepane|Analysis window Model Cache pane]].<br />
<br />
Also, remember that after you load data into a calibration control, you can place your mouse pointer on the control to view not only information about the loaded data, but also, different instructions about working with the control. In the figure below, data has been loaded into the X calibration control for a PCA analysis.<br />
<br />
:''Example of loaded data in the X calibration control for a PCA analysis''<br />
<br />
::[[Image:PCA_analysis_xblock_data_loaded_Cal.png|269x129px]]<br />
::<br />
<br />
::<br />
::<br />
<br />
After you have opened the Analysis window and loaded the calibration data, you then calculate the initial model. To calculate the initial calibration model, you can do one of the following:<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* On the Analysis window toolbar, click the Calculate/Apply model icon [[Image:Calculate_Apply_Model_icon.png|25x22px]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Click the Model control.<br />
<br />
|}<br />
<br />
:''Clicking the Model control in the Analysis window''<br />
<br />
::<br />
::<br />
<br />
::[[Image:Clicking_to_calculate_model.png|343x78px]]<br />
::<br />
<br />
After the initial model is calculated, you can place your mouse pointer on the Model control to view general information about the model. To view detailed information the model, right-click on the Model control and on the context menu that opens, select Show Model Details.<br />
<br />
:''Showing model details in the Analysis window''<br />
<br />
::[[Image:Information_initial_model.png|350x138px]]<br />
::<br />
<br />
::<br />
::<br />
<br />
::<br />
::<br />
<br />
===Changing the number of components===<br />
<br />
For analysis methods which use factors or principal components, you can choose a different number of components or factors to retain in the model and then recalculate the model. To choose a different number of components or factors:<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|1.<br />
<br />
|Click on the appropriate row in the Control panel.<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|2.<br />
<br />
|Recalculate the model by doing one of the following:<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* On the Analysis window toolbar, click the Calculate/Apply model icon [[Image:Calculate_Apply_Model_icon.png|25x22px]].<br />
<br />
|}<br />
<br />
{| style="margin-left:18pt" <br />
<br />
|- valign="top" <br />
<br />
|<br />
* Click the Model control.<br />
<br />
|}<br />
<br />
'''Note:''' By default, the maximum number of principal components or factors that you can retain in a model is 20. You can change this value in the Analysis options settings for the Edit menu. For example, the figure below shows an initial model calculated for a PCA analysis with the suggested value for the number of components to retain set to three.<br />
<br />
:''Initial model calculated for a PCA analysis with number of suggested components = 3''<br />
<br />
::[[Image:Control_pane_PCA.png|359x353px]]<br />
::<br />
<br />
After you select a different number of components or factors to retain, the Model control is marked with an Exclamation icon indicating that you must recalculate the model.<br />
<br />
:''Model marked for recalculation''<br />
<br />
::[[Image:ModelBuilding_CalibrationPhase.23.1.07.jpg|580x254px]]<br />
::<br />
<br />
::<br />
::<br />
<br />
===Examining and refining the model===<br />
<br />
After the model is calculated, the Control pane displays the percent variance captured and other statistical information for the model. For certain analyses, the application provides a suggested number of components or factors to retain for the model based on internal tests. For example, the figure below shows an initial model calculated for a PCA analysis with the suggested value for the number of components to retain set to three.<br />
<br />
:''Initial model calculated for a PCA analysis with number of suggested components = 3''<br />
<br />
::[[Image:Control_pane_PCA.png|359x353px]]<br />
::<br />
<br />
The Analysis window toolbar is updated dynamically with other toolbar buttons based on the selected analysis method. All of these toolbar buttons create plots and other visual aids that assist you in examining and refining the model by excluding certain samples and/or variables to enhance the model performance. Common toolbar buttons include the following:<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* The Plot Eigenvalues button [[Image:Plot_Eigenvalues_icon.png|19x20px]]. See [[ModelBuilding_PlottingEigenValues|Plotting Eigenvalues for a Calibration Model]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* The Plot scores and sample statistics button [[Image:Plot_scores_sample_statistics_icon.png|16x19px]]. See [[ModelBuilding_PlottingScores|Plotting Scores and Statistical Values for a Calibration Model]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* The Plot loads and variable statistics button [[Image:Plot_loads_variable_statistics_icon.png|21x20px]]. See [[ModelBuilding_PlottingLoads|Plotting Loads and Variable Statistics for a Calibration Model]].<br />
<br />
|}<br />
<br />
{| <br />
<br />
|- valign="top" <br />
<br />
|<br />
* The Scores and loadings biplots button [[Image:Biplot_button.png|35x34px]]. See [[ModelBuilding_Biplot|Scores and Loadings Biplots for a Calibration Model]].<br />
<br />
|}<br />
<br />
'''Note:''' All other Analysis window toolbar buttons are specific to an analysis method and therefore, are not discussed in this guide.<br />
<br />
::<br />
::<br />
<br />
::<br />
::</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Knn&diff=11019Knn2020-02-06T22:13:28Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
K-nearest neighbor classifier.<br />
<br />
===Synopsis===<br />
<br />
:pclass = knn(xref,xtest,k,options); %make prediction without model<br />
:pclass = knn(xref,xtest,options); %use default k<br />
:model = knn(xref,k,options) %create model<br />
:modelp = knn(xref,model,k,options) %apply model to xtest<br />
:modelp = knn(xtest,model,options) %apply model to xtest; predictions (equivalent to pclass) in modelp.classification.mostprobable.<br />
:[pclass,closest,votes] = knn(xref,xtest,k,options); %make prediction without model<br />
:[pclass,closest,votes] = knn(xref,xtest,options); %use default k<br />
:[pclass,closest,votes] = knn(xref,k,options); %self-prediction without model<br />
: knn % Launches an Analysis window with KNN as the selected method.<br />
<br />
Please note that the recommended way to build and apply a K-nearest neighbor model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs kNN classification where the "k" closest samples in a reference set vote on the class of an unknown sample based on distance to the reference samples. If no majority is found, the unknown is assigned the class of the closest sample (see input options for other no-majority behaviors).<br />
<br />
====Inputs====<br />
<br />
* '''xref''' = a DataSet object of reference data,<br />
<br />
* '''xtest''' = a DataSet object or Double containing the unknown test data.<br />
<br />
====Optional Inputs====<br />
<br />
* '''''model''' '' = an optional standard KNN model structure which can be passed instead of xref (note order of inputs: (xtest,model) ) to apply model to test data.<br />
<br />
* '''k''' = number of components {default = rank of X-block}.<br />
<br />
====Outputs====<br />
<br />
* '''pclass''' = the voted closest class, if a majority of nearest neighbors were of the same class, or the class of the closest sample, if no majority was found (Only returned if xtest is supplied).<br />
<br />
* '''closest''' = matrix of samples (rows) by closest neighbor index (columns). Will always have k columns indicating which samples were the closest to the given sample (row).<br />
* '''votes''' = maxtix of samples (rows) by class numbers voted for (columns). Will always have k columns indicating which classes were voted for by each nearest neighbor corresponding to closest matrix.<br />
<br />
* '''model''' = if no test data (xtest) is supplied, a standard model structure is returned which can be used with test data in the future to perform a prediction. Note that information about the classification of X-block samples is available in the '''classification''' field, described at [[Standard_Model_Structure#model|Standard Model]]. <br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]].<br />
<br />
===Options===<br />
<br />
'''options''' = structure array with the following fields :<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to screen.<br />
<br />
* '''waitbar''' : [ 'off' | 'on' |{'auto'}] governs display of a waitbar when classifying. 'on' always shows a waitbar, 'off' never shows a waitbar, 'auto' shows a waitbar only when the data is particularly large.<br />
<br />
* '''preprocessing''': { [ ] } A cell containing a preprocessing structure or keyword (see PREPROCESS). Use {'autoscale'} to perform autoscaling on reference and test data.<br />
<br />
* '''classset''' : [ 1 ] indicates which class set in xref to use.<br />
<br />
* '''nomajority''': [ 'error' | {'closest'} | class_number ] Behavior when no majority is found in the votes. 'closest' = return class of closest sample. 'error' = give error message. class_number (i.e. any numerical value) = return this value for no-majority votes (e.g. use 0 to return zero for all no-majority votes)<br />
<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[cluster]], [[dbscan]], [[knnscoredistance]], [[modelselector]], [[plsda]], [[simca]], [[svmda]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Xgbda&diff=11018Xgbda2020-02-06T22:12:25Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Gradient Boosted Tree Ensemble for classification (Discriminant Analysis) using XGBoost.<br />
<br />
===Synopsis===<br />
<br />
: model = xgbda(x,options); %identifies model using classes in x<br />
: model = xgbda(x,y,options); %identifies model using y for classes<br />
: pred = xgbda(x,model,options); %makes predictions with a new X-block<br />
: valid = xgbda(x,y,model,options); %performs a "test" call with a new X-block with known y-classes <br />
<br />
Please note that the recommended way to build and apply a Gradient Boosted Tree Ensemble for classification using XGBoost model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
XGB performs calibration and application of gradient boosted decision tree models for classification. These are non-linear models which predict the probability of a test sample belonging to each of the modeled classes, hence they predict the class of a test sample.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset".<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset". If omitted in a calibration call, the x-block must be a dataset object with classes in the first mode (samples). y can always be omitted in a prediction call (when a model is passed) If y is omitted in a prediction call, x will be checked for classes. If found, these classes will be assumed to be the ones corresponding to the model.<br />
* '''model''' = previously generated model (when applying model to new data)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the xgboost model (see [[Standard Model Structure]]). Feature scores are contained in model.detail.xgb.featurescores.<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''' [ 'none' | {'final'} ] governs level of plotting.<br />
* '''waitbar''': [ off | {'on'} ] governs display of waitbar during optimization and predictions.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'xgboost' ] algorithm to use. xgboost is default and currently only option.<br />
* '''classset''' : [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''xgbtype''' : [ 'xgbr' | {'xgbc'} ] Type of XGB to apply. Default is 'xgbc' for classification, and 'xgbr' for regression. <br />
* '''compression''' : [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the XGB model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the xgbtype). Compression can make the XGB more stable and less prone to overfitting.<br />
* '''compressncomp''' : [ 1 ] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance corrected scores from compression model.<br />
<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance correctedscores from compression model.<br />
* '''cvi''' : { { 'rnd' 5 } } Standard cross-validation cell (see crossval)defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values.Alternatively, can be a vector with the same number of elements as x has rows with integer values indicating CV subsets (see crossval).<br />
* '''eta''' : Value(s) to use for XGBoost 'eta' parameter. Eta controls the learning rate of the gradient boosting.Values in range (0,1]. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [0.1, 0.3, 0.5].<br />
* '''max_depth''' : Value(s) to use for XGBoost 'max_depth' parameter. Specifies the maximum depth allowed for the decision trees. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 6 values [1 2 3 4 5 6].<br />
* '''num_round''' : Value(s) to use for XGBoost 'num_round' parameter. Specifies how many rounds of tree creation to perform. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [100 300 500].<br />
<br />
* '''strictthreshold''' : [0.5] Probability threshold for assigning a sample to a class. Affects model.classification.inclass.<br />
* '''predictionrule''' : { {'mostprobable'} | 'strict' ] governs which classification prediction statistics appear first in the confusion matrix and confusion table summaries.<br />
<br />
===Algorithm===<br />
Xgbda is implemented using the [https://xgboost.readthedocs.io XGBoost] package. User-specified values are used for XGBoost parameters (see ''options'' above). See [https://xgboost.readthedocs.io/en/latest/parameter.html XGBoost Parameters] for further details of these options. <br />
<br />
The default XGBDA parameters eta, max_depth and num_round have value ranges rather than single values. This xgbda function uses a search over the grid of appropriate parameters using cross-validation to select the optimal XGBoost parameter values and builds an XGBDA model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
===Choosing the best XGBDA parameters===<br />
The recommended technique is to repeatedly test XGBDA using different parameter values and select the parameter combination which gives the best results. XGBDA searches over ranges of parameters eta, max_depth, and num_round, by default. The actual values tested can be specified by the user by setting the associated parameter option value. Each test builds an XGBDA model on the calibration data using cross-validation to produce a mis-classification rate result for that test. These tests are compared over all tested parameter combinations to find which combination gives the best cross-validation prediction (smallest mis-classification). The XGBDA model is then built using the optimal parameter setting.<br />
<br />
====XGBDA parameter search summary plot====<br />
When XGBDA is run in the Analysis window it is possible to view the results of the XGBDA parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If at least two XGB parameters were initialized with parameter ranges, for example eta and max_depth,, then a figure appears showing the performance of the model plotted against eta and max_depth (Fig. 1). The measure of performance used is the misclassification rate, defined as the number of incorrectly classified samples divided by the number of classified samples, based on the cross-validation (CV) predictions for the calibration data. The lowest value of misclassification rate is marked on the plot by an "X" and this indicates the values of the XGBDA eta and max_depth parameters which yield the best performing model. The actual XGBDA model is built using these parameter values. If all three parameters, eta, max_depth, and num_round have ranges of values then you can view the classification performance over the other variables' ranges by clicking on the blue horizontal arrow toolbar icon above the plot. In Analysis XGBDA the optimal parameters are also reported in the model summary window which is shown when you mouse-over the model icon, once the model is built. If you are using the command line XGBDA function to build a model then the optimal XGBDA parameters are shown in model.detail.xgb.cvscan.best. <br />
<gallery caption="Fig. 1. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_survey.png|Misclassification as a function of XGB parameters.<br />
</gallery><br />
<br />
===Variable Importance Plot===<br />
The ease of interpreting single decision trees is lost when a sequence of boosted trees is used, as in XGBoost. One commonly used diagnostic quantity for interpreting boosted trees is the "feature importance", or "variable importance" in PLS_Toolbox terminology. This is a measure of each variable's importance to the tree ensemble construction. It is calculated for each variable by summing up the “gain” on each node where that variable was used for splitting, over all trees in the sequence. "gain" refers to the reduction in the loss function being optimized. The important variables are shown in the XGBDA Analysis window when the model is built, ranked by their importance (Fig. 2). <br />
<gallery caption="Fig. 2. Variable importance plot" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_varimp.png|XGBDA variable importance. Right-click in the plot area to copy the indices of the important variables. Clicking on the "Plot" button opens a version of the plot which can be zoomed or panned.<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[knn]], [[lwr]], [[pls]], [[plsda]], [[xgb]], [[xgbengine]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Xgb&diff=11017Xgb2020-02-06T22:11:34Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Gradient Boosted Tree Ensemble for regression using XGBoost.<br />
<br />
===Synopsis===<br />
<br />
:model = xgb(x,y,options); %identifies model (calibration step)<br />
:pred = xgb(x,model,options); %makes predictions with a new X-block<br />
:valid = xgb(x,y,model,options); %performs a "test" call with a new X-block and known y-values<br />
<br />
Please note that the recommended way to build and apply a Gradient Boosted Tree Ensemble for regression using XGBoost model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
To choose between regression and classification, use the xgbtype option:<br />
:: regression : xgbtype = 'xgbr'<br />
:: classification : xgbtype = 'xgbc'<br />
It is recommended that classification be done through the xgbda function.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset",<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset",<br />
* '''model''' = previously generated model (when applying model to new data)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the xgboost model (see [[Standard Model Structure]]). Feature scores are contained in model.detail.xgb.featurescores.<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''' [ 'none' | {'final'} ] governs level of plotting.<br />
* '''waitbar''': [ off | {'on'} ] governs display of waitbar during optimization and predictions.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'xgboost' ] algorithm to use. xgboost is default and currently only option.<br />
* '''classset''' : [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''xgbtype''' : [ {'xgbr'} | 'xgbc' ] Type of XGB to apply. Default is 'xgbc' for classification, and 'xgbr' for regression. <br />
* '''compression''' : [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the XGB model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the xgbtype). Compression can make the XGB more stable and less prone to overfitting.<br />
* '''compressncomp''' : [ 1 ] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance correctedscores from compression model.<br />
* '''cvi''' : { { 'rnd' 5 } } Standard cross-validation cell (see crossval)defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values.Alternatively, can be a vector with the same number of elements as x has rows with integer values indicating CV subsets (see crossval).<br />
* '''eta''' : Value(s) to use for XGBoost 'eta' parameter. Eta controls the learning rate of the gradient boosting.Values in range (0,1]. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [0.1, 0.3, 0.5].<br />
* '''max_depth''' : Value(s) to use for XGBoost 'max_depth' parameter. Specifies the maximum depth allowed for the decision trees. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 6 values [1 2 3 4 5 6].<br />
* '''num_round''' : Value(s) to use for XGBoost 'num_round' parameter. Specifies how many rounds of tree creation to perform. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [100 300 500].<br />
<br />
===Algorithm===<br />
Xgb is implemented using the [https://xgboost.readthedocs.io XGBoost] package. User-specified values are used for XGBoost parameters (see ''options'' above). See [https://xgboost.readthedocs.io/en/latest/parameter.html XGBoost Parameters] for further details of these options. <br />
<br />
The default XGB parameters eta, max_depth and num_round have value ranges rather than single values. This xgb function uses a search over the grid of appropriate parameters using cross-validation to select the optimal XGBoost parameter values and builds an XGB model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
===Choosing the best XGB parameters===<br />
The recommended technique is to repeatedly test XGB using different parameter values and select the parameter combination which gives the best results. XGB searches over ranges of parameters eta, max_depth, and num_round, by default. The actual values tested can be specified by the user by setting the associated parameter option value. Each test builds an XGB model on the calibration data using cross-validation to produce root mean square error (RMSECV) result for that test. These tests are compared over all tested parameter combinations to find which combination gives the best cross-validation prediction (smallest RMSECV). The XGB model is then built using the optimal parameter setting.<br />
<br />
====XGB parameter search summary plot====<br />
When XGB is run in the Analysis window it is possible to view the results of the XGB parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If at least two XGB parameters were initialized with parameter ranges, for example eta and max_depth, then a figure appears showing the performance of the model plotted against eta and max_depth (Fig. 1). The measure of performance used is the root mean square error based on the cross-validation predictions predictions for the calibration data (RMSECV). The lowest value of RMSECV is marked on the plot by an "X" and this indicates the values of the XGB eta and max_depth parameters which yield the best performing model. The actual XGB model is built using these parameter values. If all three parameters, eta, max_depth, and num_round have ranges of values then you can view the prediction performance over the other variables' ranges by clicking on the blue horizontal arrow toolbar icon above the plot. In Analysis XGB the optimal parameters are also reported in the model summary window which is shown when you mouse-over the model icon, once the model is built. If you are using the command line XGB function to build a model then the optimal XGB parameters are shown in model.detail.xgb.cvscan.best. <br />
<br />
<gallery caption="Fig. 1. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Xgb_survey.png|RMSECV as a function of XGB parameters.<br />
</gallery><br />
<br />
===Variable Importance Plot===<br />
The ease of interpreting single decision trees is lost when a sequence of boosted trees is used, as in XGBoost. One commonly used diagnostic quantity for interpreting boosted trees is the "feature importance", or "variable importance" in PLS_Toolbox terminology. This is a measure of each variable's importance to the tree ensemble construction. It is calculated for each variable by summing up the “gain” on each node where that variable was used for splitting, over all trees in the sequence. "gain" refers to the reduction in the loss function being optimized.<br />
The important variables are shown in the XGB Analysis window when the model is built, ranked by their importance (Fig. 2). <br />
<gallery caption="Fig. 2. Variable importance plot" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_varimp.png|XGB variable importance. Right-click in the plot area to copy the indices of the important variables. Clicking on the "Plot" button opens a version of the plot which can be zoomed or panned.<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[knn]], [[lwr]], [[pls]], [[plsda]], [[xgbda]], [[xgbengine]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Npls&diff=11016Npls2020-02-06T22:11:05Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multilinear-PLS (N-PLS) for true multi-way regression.<br />
<br />
===Synopsis===<br />
<br />
:model = npls(x,y,ncomp,''options'')<br />
:pred = npls(x,ncomp,model,''options'')<br />
<br />
Please note that the recommended way to build and apply a N-PLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
NPLS fits a multilinear PLS1 or PLS2 regression model to x and y [R. Bro, J. Chemom., 1996, 10(1), 47-62]. The NPLS function also can be used for calibration and prediction.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block,<br />
<br />
* '''y''' = Y-block, and<br />
<br />
* '''ncomp''' = the number of factors to compute, or<br />
<br />
* '''model''' = in prediction mode, this is a structure containing a NPLS model.<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure (see: [[Standard Model Structure]]) with the following fields:<br />
<br />
* '''modeltype''': 'NPLS',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''reg''': cell array with regression coefficients,<br />
<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
<br />
* '''core''': cell array with the NPLS core,<br />
<br />
* '''pred''': cell array with model predictions for each input data block,<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
* '''''options''''' = options structure containing the fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting,<br />
<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
<br />
* '''outputregrescoef''': if this is set to 0 no regressions coefficients associated with the X-block directly are calculated (relevant for large arrays), and<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is like 'standard' but the residual limits in the model structure are also left empty (.model.detail.reslim.lim95, model.detail.reslim.lim99).<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[conload]], [[datahat]], [[explode]], [[gram]], [[modlrder]], [[mpca]], [[crossval]], [[outerm]], [[parafac]], [[parafac2]], [[pls]], [[tld]], [[unfoldm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Lwr&diff=11015Lwr2020-02-06T22:10:17Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
LWR locally weighted regression for univariate Y.<br />
<br />
===Synopsis===<br />
<br />
:model = lwr(x,y,ncomp,''npts'',''options''); %identifies model (calibration step)<br />
:pred = lwr(x,model,''options''); %makes predictions with a new X-block<br />
:valid = lwr(x,y,model,''options''); %makes predictions with new X- & Y-block<br />
:lwr % Launches an Analysis window with LWR as the selected method.<br />
<br />
Please note that the recommended way to build and apply a LWR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]].<br />
<br />
===Description===<br />
<br />
LWR calculates a single locally weighted regression model using the given number of principal components <tt>ncomp</tt> to predict a dependent variable <tt>y</tt> from a set of independent variables <tt>x</tt>. <br />
<br />
LWR models are useful for performing predictions when the dependent variable, <tt>y</tt>, has a non-linear relationship with the measured independent variables, <tt>x</tt>. Because such responses can often be approximated by a linear function on a small (local) scale, LWR models work by choosing a subset of the calibration data (the "local" calibration samples) to create a "local" model for a given new sample. The local calibration samples are identified as the samples closest to a new sample in the score space of a PCA model (the "selector model".), using the Mahalanobis distance measure. Models are defined using the number principal components used for the selector model (<tt>ncomp</tt>), and the number of points (samples) selected as local (<tt>npts</tt>). <br />
<br />
Once the samples are selected, one of three algorithms are used to calculate the local model:<br />
:* '''globalpcr''' = the scores from the PCA selector model (for the selected samples) are used to calculate a PCR model. This model is more stable when there are fewer samples being selected, but may not perform as well with high degrees of non-linearity.<br />
:* '''pcr''' / '''pls''' = the raw data of the selected samples are used to create a weighted PCR or PLS model. These models are more adaptable to highly varying non-linearity but may also be less stable when fewer samples are being selected. <br />
<br />
The LWR function can be used in 'predicton mode' to apply a previously built LWR model, <tt>model</tt>, to a new set of data in <tt>x</tt>, in order to generate y-values for these data. <br />
<br />
Furthermore, if matching x-block and y-block measurements are available for an external test set, then LWR can be used in 'validation mode' to predict the y-values of the test data from the model <tt>model</tt> and <tt>x</tt>, and allow comparison of these predicted y-values to the known y-values <tt>y</tt>.<br />
<br />
For more information on the basic LWR algorithm, see <tt>T. Naes, T. Isaksson, B. Kowalski, Anal Chem 62 (1990) 664-673.</tt><br />
For details on the use of y distance when selecting nearest points (option alpha), see <tt>Z. Wang, T. Isaksson, B. R. Kowalski, (1994). Anal Chem 66 (1994) 249–260.</tt><br />
<br />
Note: Calling lwr with no inputs starts the graphical user interface (GUI) for this analysis method. There is a<br />
[[Image:Movie.png|link=http://www.eigenvector.com/eigenguide.php?m=Nonlinear_methods_3]]<br />
[http://www.eigenvector.com/eigenguide.php?m=Nonlinear_methods_3 video using the LWR interface] on the Eigenvector Research web page.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset"<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset"<br />
* '''ncomp''' = the number of latent variables to be calculated (positive integer scalar)<br />
* '''npts''' = the number of points to use in local regression (positive integer scalar)<br />
* '''model''' = previously generated lwr model<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model (see [[Standard Model Structure]])<br />
* '''pred''' a structure, similar to '''model''', that contains scores, predictions, etc. for the new data.<br />
* '''valid''' a structure, similar to '''model''', that contains scores, predictions, and additional y-block statistics, etc. for the new data.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''waitbar''': [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'globalpcr' | {'pcr'} | 'pls' ] LWR algorithm to use. Method of regression after samples are selected. 'globalpcr' performs PCR based on the PCs calculated from the entire calibration data set but a regression vector calculated from only the selected samples. 'pcr' and 'pls' calculate a local PCR or PLS model based only on the selected samples.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
* '''reglvs''': [] Used only when algorithm is 'pcr' or 'pls', this is the number of latent variables/principal components to use in the local regression model, if different from the number selected in the SSQ Table. The number of components selected in the SSQ table is used to generate the global PCA model which is used to select the local calibration samples. [] (Empty) implies LWRPRED should use the same number of latent variables in the local regression as were used in the global PCA model. NOTE: This option is NOT used when algorithm is 'globalpcr'.<br />
* '''iter''': [{5}] Iterations in determining local points. Used only when alpha > 0 (i.e. when using y-distance scaling).<br />
* '''alpha''': [ {0} ], has value in range [0-1]. Weighting of y-distances in selection of local points. 0 = do not consider y-distances {default}, 1 = consider ONLY y-distances. With any positive alpha, the algorithm will tend to select samples which are close in both the PC space but which also have similar y-values. This is accomplished by repeating the prediction multiple times. In the first iteration, the selection of samples is done only on the PC space. Subsequent iterations take into account the comparison between predicted y-value of the new sample and the measured y-values of the calibration samples.<br />
The default options can be retreived using: options = lwr('options');.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[ann]], [[lwrpred]], [[modelstruct]], [[pls]], [[pcr]], [[preprocess]], [[svm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Cls&diff=11014Cls2020-02-06T22:09:24Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Classical Least Squares regression for multivariate Y.<br />
<br />
===Synopsis===<br />
<br />
: model = cls(x,options); %identifies model (calibration step)<br />
: model = cls(x,y,options); %identifies model (calibration step)<br />
: pred = cls(x,model,options); %makes predictions with a new X-block<br />
: valid = cls(x,y,model,options); %makes predictions with new X- & Y-block<br />
: cls % Launches the Analysis window with CLS as the selected method.<br />
<br />
Please note that the recommended way to build and apply a CLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
CLS identifies models of the form '''y = Xb + e'''.<br />
<br />
====Inputs====<br />
* '''x''' = X-block: predictor block (2-way array or DataSet Object).<br />
<br />
====Optional Inputs====<br />
<br />
* '''y''' = Y-block: predicted block (2-way array or DataSet Object). The number of columns of y indicates the number of components in the model (each row specifies the mixture present in the given sample). If y is omitted, x is assumed to be a set of pure component responses (e.g. spectra) defining the model itself.<br />
<br />
====Outputs====<br />
* '''model''' = standard model structure containing the CLS model (See [[Standard Model Structure]]).<br />
* '''pred''' = structure array with predictions.<br />
* '''valid''' = structure array with predictions.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
* '''preprocessing''': { [] [] } preprocessing structure (see PREPROCESS).<br />
* '''algorithm''': [ {'ls'} | 'nnls' | 'snnls' | 'cnnls' | 'stepwise' | 'stepwisennls' ] Specifies the regression algorithm.<br />
:Options are: <br />
: ls = a standard least-squares fit.<br />
: snnls = non-negative least squares on spectra (S) only.<br />
: cnnls = non-negative least squares on concentrations (C) only.<br />
: nnls = non-negative least squares fit on both C and S.<br />
: stepwise = stepwise least squares<br />
: stepwisennls = stepwise non-negative least squares<br />
<br />
* '''confidencelimit''': [{0.95}] Confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[pcr]], [[pls]], [[preprocess]], [[stepwise regrcls]], [[testrobustness]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Svmda&diff=11013Svmda2020-02-06T22:07:47Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
SVMDA Support Vector Machine (LIBSVM) for classification. Use SVM for support vector machine regression([[Svm]]).<br />
<br />
===Synopsis===<br />
<br />
:model = svmda(x,options); %identifies model (calibration step) based on x-block classes<br />
:model = svmda(x,y,options); %identifies model (calibration step)<br />
:pred = svmda(x,model,options); %makes predictions with a new X-block<br />
:pred = svmda(x,y,model,options); %performs a "test" call with a new X-block and known <br />
:svmda % Launches an analysis window with svmda as the selected method. <br />
<br />
Please note that the recommended way to build and apply a SVMDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]].<br />
<br />
===Description===<br />
<br />
SVMDA performs calibration and application of Support Vector Machine (SVM) classification models. (Please see the svm function for support vector machine regression problems). These are non-linear models which can be used for classification problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the classification based on either the classes field of the calibration x-block or a y-block which contains integer-valued classes. It is recommended that regression be done through the [[Svm|svm]] function.<br />
<br />
Svmda is implemented using the LIBSVM package which provides both cost-support vector regression (C-SVC) and nu-support vector regression (nu-SVC). Linear and Gaussian Radial Basis Function kernel types are supported by this function.<br />
<br />
Note: Calling svmda with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing integer values,<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'SVM',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved),<br />
** '''classification''': information about the classification of X-block samples (see description at [[Standard_Model_Structure#model|Standard Model]]). For more information on class predictions, see [[Sample Classification Predictions]].,<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java). Note that the number of support vectors used is given by model.detai.svm.model.l. It is useful to check this because it can indicate overfitting if most of the calibration samples are used as support vectors, or can indicate problems fitting a model if there are no support vectors (and all prediction values will equal a constant value, a weighted mean).<br />
*** model.detail.svm.cvscan: results of CV parameter scan<br />
*** model.detail.svm.svindices: Indices of X-block samples which are support vectors.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
** '''pred''': The vector pred.pred{2} will contain the class predictions for each sample.<br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]]<br />
<br />
====Options====<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''classset''' [ {1} ], indicates which class set in x to use when no y-block is provided,<br />
* '''preprocessing''': {[]} preprocessing structures for x block (see PREPROCESS). NOTE that y-block preprocessing is NOT used with SVMDA. Any y-preprocessing will be ignored.<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the SVM model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the svmtype). Compression can make the SVM more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.<br />
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.<br />
* '''svmtype''': [ {'c-svc'} | 'nu-svc' ] Type of SVM to apply. The default is 'c-svc' for classification.<br />
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"<br />
<br />
* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 10 (seconds);<br />
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5. This option is only used when the "cvi" option is empty.<br />
* '''cvi''': {{}} Standard cross-validation cell (see crossval) defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values. If empty (the default), then random cross-validation is done based on the "splits" option.<br />
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.<br />
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.<br />
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
<br />
===Algorithm===<br />
Svmda uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options. <br />
<br />
The default SVMDA parameters cost, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
====Model building performance====<br />
Building a single SVM model can sometimes be slow, especially if the calibration dataset is large. Using ranges for the SVM parameters to search for the optimal parameter combination increases the final model building time significantly. If cross-validation is used the calculation is again increased, possibly dramatically if the number of CV subsets is large. Some suggestions for faster SVM building include: <br />
:1) Turning CV off ("none") during preliminary analyses. This is MUCH faster and cross-validation is still performed using a default "Random Subsets" with 5 data splits and 1 iteration,<br />
:2) Using a coarse grid of SVM parameter values to search over for optimal values, <br />
:3) Choosing the CV method carefully, at least initially. For example, use "Random Subsets" with a small number of data splits (e.g. 5) and a small "Number of Iterations" (e.g. 1).<br />
:4) Using the compression option if the number of variables is large.<br />
<br />
====C-SVC and nu-SVC====<br />
There are two commonly used versions of SVM classification, 'C-SVC' and 'nu-SVC'.<br />
The original SVM formulations for Classification (SVC) used parameter C, [0, inf), to apply a penalty to the optimization for data points which were not correctly separated by the classifying hyperplane. An alternative version of SVM classification was later developed where the C penalty parameter was replaced by a nu, [0,1], parameter which applies a slightly different penalty. The main motivation for the nu version of SVM is that it has a has a more meaningful interpretation because nu represents an upper bound on the fraction of training samples which are errors (misclassified) and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C.<br />
C and nu are just different versions of the penalty parameter. The same optimization problem is solved in either case. Thus it should not matter which form of SVM you use, C versus nu for classification. PLS_Toolbox uses the C version by default since this was the original formulation and is the most commonly used form. For more details on 'nu' SVMs see [http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf]<br />
<br />
The user must provide parameters (or parameter ranges) for SVM classification as:<br />
:*'C-SVC':<br />
::'''C''', (using linear kernel), or<br />
::'''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
:*'nu-SVC':<br />
::'''nu''', (using linear kernel), or<br />
::'''nu''', '''gamma''' (using radial basis function kernel),<br />
<br />
====Class prediction probabilities====<br />
LIBSVM calculates the probabilities of each sample belonging to each possible class if the "Probability Estimates" option is enabled (default setting) in the SVMDA analysis window (or if the ''probabilityestimates'' option is set equal to 1 (default value) in command line usage). The method is explained in [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf], section 8, "Probability Estimates". <br />
PLS_Toolbox provides these probability estimates in model.detail.predprobability or predict.detail.predprobability, which are nsample x nclasses arrays. The columns are the classes, in the order given by model.detail.svm.model.label (or prediction.detail.svm.model.label), where the class values are what was in the input X-block.class{1} or Y-block. These probabilities are used to find the most likely class for each sample and this is saved in pred.pred{2} and model.detail.predictedclass. This is a vector of length equal to the number of samples with values equal to class values (model.detail.class{1}).<br />
<br />
====SVMDA Parameters====<br />
<br />
* '''cost''': Cost [0 ->inf] represents the penalty associated with errors. Error refers to a sample which do not lie on the proper side of the margin for that sample's class. Increasing cost value causes closer fitting to the calibration/training data and usually a narrower margin width. ''nu'' is not required if ''cost'' is specified.<br />
* '''gamma''': Kernel ''gamma'' parameter controls the shape of the separating hyperplane. Increasing gamma usually increases number of support vectors.<br />
* '''nu''': Nu (0 -> 1] is an alternative parameter for specifying the penalty associated with errors. It indicates a lower bound on the number of support vectors to use, given as a fraction of total calibration samples, and an upper bound on the fraction of training samples which are errors (misclassified). ''cost'' is not required if ''nu'' is specified. There is a constraint on the nu parameter, however, related to the number of training data points in each class. For every class pair, having n1 and n2 data points each, nu must be less than 2*min(n1, n2)/(n1+n2), i.e. nu must be less than the ratio of the smaller class count to the pair average class count. SVMDA automatically checks for this possibility in nu-svc.<br />
<br />
===Examples of SVMDA models on simple two-class data===<br />
Users of SVMs will usually not pick the values for their SVM parameters cost/nu and gamma because there is no simple way to know what values would provide a good model for their data. Instead, they should search over parameter ranges testing SVM models to find which parameter combination works best for their data, as discussed below. However, it is still a good idea to have an idea of how these parameters affect how the SVM works on their data. For this reason we look here at the effects of cost/nu and gamma on a very simple dataset, an x-block of two variables where the data belong to just two classes, to allow visualization of the optimal separating boundary. In practice the user will usually work with multivariate x-block data having more than two variables and data belonging to multiple classes, so will only view the predicted classes versus actual classes and related skill measures, and some details such as the number of support vectors involved.<br />
<br />
The effects of the cost, gamma and nu parameters on SVMDA are examined by applying SVMDA to a simple two-variable (x1,x2) dataset where 100 samples belong to red class and 100 to blue class. This is equivalent to an X-block having dimensions 200x2. The data are distributed as three clusters, two red clusters with 50 points each which lie nearly on either side of a blue cluster which has 100 points. SVMDA attempts to draw a dividing line between these clusters separating the x1 vs x2 domain into red and blue regions. It uses these calibration data points to find the optimal separating decision boundary (hyperplane) with the widest separating margin. Any future test samples will be classified as red or blue according to which side of the separating boundary they occur on. The following images show SVMDA classification models trained on these data using an RBF kernel and varying values for the cost, gamma and nu parameters. Note that an SVMDA model with linear kernel cannot be a good model for this dataset since the red and blue points cannot be separated by a straight line, linear boundary.<br />
<br />
<gallery caption="Fig. 1. Two-class dataset" widths="400px" heights="300px" perrow="1"><br />
File:Two_class_data.png|Two-variable data with 100 red samples and 100 blue samples.<br />
</gallery><br />
<br />
The figures below show results for various SVMDA models built on the simple dataset. They are presented with the decision boundary shown as a black contour line, the margin edges shown by blue and red contours, data points which are support vectors marked by an enclosing circle, and data points which lie on the wrong side of the decision boundary (classification errors) marked with an 'x'. The decision boundary represents the zero contour of the decision function, blue and red margin edges represent the -1 and +1 contours of the decision function.<br />
<br />
====Effect of varying cost parameter for SVMDA using RBF kernel====<br />
Fig. 2a-d show the effect of increasing the cost parameter from 0.1 to 100 while gamma is kept fixed = 0.01. When the cost is small. Fig. 2a, the margin is wide since there is a small penalty associated with data points which are within the margin. Note that any point which lies within the margin or on the wrong side of the decision boundary is a support vector. Increasing the cost parameter leads to a narrowing of the margin width and fewer data points remaining within the margin, until cost = 100 (Fig. 1d) where the margin is narrow enough to avoid having any points remain inside it. Further increases in cost have no effect on the margin since no data points remain to be penalized. At the other extreme, when cost is reduced to 0.01 or smaller, the margin expands until it encloses all the data points, so all points are support vectors. This is undesirable since fewer support vectors make a more efficient model when predicting for new data points and reduces the chance of overfitting the data. In this simple case, the separating boundary in all these cases keeps approximately the same smooth contour as in Fig. 2a, so overfitting is not an issue. If there was more overlapping of the red and blue data points then larger cost parameter would cause the separating boundary to deform more and the margin edges to be much more contorted as it tries to exclude data points from the margin.<br />
<br />
<br />
<gallery caption="Fig. 2. Effect of varying ''cost'' parameter, with ''gamma'' = 0.01" widths="400px" heights="300px" perrow="2"><br />
File:C0p1g0p01.png|a) ''cost = 0.1''<br />
File:C1g0p01.png|b) ''cost = 1.0''<br />
File:C10g0p01.png|c) ''cost = 10''<br />
File:C100g0p01.png|d) ''cost = 100''<br />
</gallery><br />
<br />
====Effect of varying gamma parameter for SVMDA using RBF kernel====<br />
Fig. 3a-f show the effect of changing the gamma parameter while cost is held fixed at 1.0. These show that gamma has a major effect on how smooth or contorted the decision boundary will be, with smaller values of gamma creating a smoother decision boundary. Fig3a shows the decision boundary to be nearly linear, showing that the SVM with RBF kernel tends to the linear kernel solution for gamma values tending towards zero. At large gamma values, however, the decision boundary becomes more contorted and shows how the SVM can over-fit the calibration data. The SVM in Fig. 3f produces a decision boundary which would not be a very good class predictor for the class of new test data samples.<br />
<br />
<br />
<gallery caption="Fig. 3. Effect of varying ''gamma'' parameter, with ''cost'' = 1.0" widths="400px" heights="300px" perrow="2"><br />
File:C1g0p0001.png|a) ''gamma = 0.0001''<br />
File:C1g0p001.png|b) ''gamma = 0.001''<br />
File:C1g0p01.png|c) ''gamma = 0.01''<br />
File:C1g0p1.png|d) ''gamma = 0.1''<br />
File:C1g1.png|e) ''gamma = 1.0''<br />
File:C1g10.png|f) ''gamma = 10.0''<br />
</gallery><br />
<br />
<br />
In summary, these comparisons show that the gamma parameter controls how smooth the decision boundary will be, with larger gamma producing more complicated boundaries, while the cost parameter controls the width of the separating margin, with larger values of cost making the margin narrower. They both affect the location of the decision boundary.<br />
<br />
====Effect of varying nu parameter for SVMDA using RBF kernel====<br />
Fig. 4a-d show the effect of decreasing the nu parameter from 0.5 to 0.01 while gamma is kept fixed = 0.01. These figures show that decreasing nu has the same effect as was obtained by increasing the cost parameter, that is, it causes the margin width to decrease. It shows how nu is simply a different representation of the cost penalty parameter, and for any value of nu there is a corresponding value of cost which produces the same SVM. The reason for its use is that its value can be interpreted as a lower bound on the number of samples which are support vectors, and also as an upper bound on the number of misclassification errors.<br />
<br />
<br />
<gallery caption="Fig. 4. Effect of varying ''nu'' parameter, with ''gamma'' = 0.01" widths="400px" heights="300px" perrow="2"><br />
File:N0p5g0p01.png|a) ''nu= 0.5''<br />
File:N0p1g0p01.png|b) ''nu = 0.1''<br />
File:N0p02g0p01.png|c) ''nu = 0.02''<br />
File:N0p01g0p01.png|d) ''nu = 0.01''<br />
</gallery><br />
<br />
<br />
{| class="wikitable" border="1" style="text-align:center; width:40%;"<br />
|+ Table 1. Compare nu value to SV fraction<br />
! nu value!! SV fraction !! number of SVs<br />
|-<br />
| 0.5 || 0.505 || 101<br />
|-<br />
| 0.1 || 0.105 || 24<br />
|-<br />
| 0.02 || 0.045 || 9<br />
|-<br />
| 0.01 || 0.020 || 4<br />
|}<br />
Table 1 shows how the value of nu is a lower bound on the support vector fraction (number of SV/200), and an upper bound on the fraction of training samples which are errors (misclassified) for the SVMs in Fig. 4. The upper bound on the fraction of misclassification is easily satisfied here because the only misclassifications were three datapoints in Fig.4a.<br />
<br />
===Choosing the best SVM parameters===<br />
The recommended technique is to repeatedly test SVMDA using different parameter values and select the parameter combination which gives the best results. For SVMDA using c-svc/nu-svc and an RBF kernel we select ranges of the c/nu and gamma parameters, choosing equi-spaced (or equi-spaced in log) parameters over the ranges. SVMDA using c-svc uses 9 values of c between 0.001 and 100, and 9 values of gamma between 10^-6 and 10 by default, then tests each of these 81 pair combinations. Each test builds a c-svc model on the calibration data using 5-fold cross-validation to produce a mis-classification rate result for that test. These tests are compared over all 81 tests to find which cost/gamma value combination gives the best cross-validation prediction (smallest mis-classification). A similar approach is used for nu-svc where values of nu and gamma are specified.<br />
The results for the best model when using the simple data in Fig. 1 are shown here in Fig. 5 for the c-svc and nu-svc cases. These models were selected by searching over the default parameter ranges for the optimal model. Note, the nu parameter range was extended to smaller values than the default nu range, to include 0.05 and 0.1.<br />
<br />
<br />
<gallery caption="Fig. 5. Optimal SVMDA models for ''c-svc'' and ''nu-svc''" widths="400px" heights="300px" perrow="2"><br />
File:Csvc_opt.png|a) ''Optimal c-svc model. cost = 0.001, gamma = 0.03''<br />
File:Nusvc_opt.png|b) ''Optimal nu-svc model. nu = 0.05, gamma = 0.003''<br />
</gallery><br />
<br />
<br />
The c-svc case in Fig. 5a has a very small cost parameter and all data points are support vectors. The decision boundary looks appropriate but this is not a good solution because of the large support vector fraction. Using an SVM to predict the class of a new sample involves calculating a sum over as many terms as there are support vectors. Thus a SVM with fewer support vectors will be faster when predicting the class of a new sample. Thus it would be good to limit the lower end of the cost parameter range to 0.1 perhaps. It should also be noted that<br />
SVMDA can have problems when using very small cost parameter (or nu very close to 1) while requesting ''probability estimates'' as this can result in bad model predictions for sample class. This problem does not arise when probability estimates are not requested. The next section discusses this problem in more detail. Note that all the models presented in Figs 1-5 were built with ''probability estimates'' disabled. Thus predictions are directly given by which side of the decision boundary the data points lie on.<br />
<br />
====SVM parameter search summary plot====<br />
When SVMDA is run in the Analysis window it is possible to view the results of the SVM parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If there are two SVM parameters with ranges of values, such as cost and gamma, then a figure appears showing the performance of the model plotted against cost and gamma (Fig. 6). The measure of performance used is the misclassification rate, defined as the number of incorrectly classified samples divided by the number of classified samples, based on the cross-validation (CV) predictions for the calibration data. The lowest value of misclassification rate is marked on the plot by an "X" and this indicates the values of the SVM cost and gamma parameters which yield the best performing model. The actual SVMDA model is built using these parameter values. If you are using the command line SVMDA function to build a model then the optimal SVM parameters are shown in model.detail.svm.cvscan.best. If you are using the graphical Analysis SVMDA then the optimal parameters are reported in the summary window which is shown when you mouse-over the model icon, once the model is built.<br />
<br />
If the parameter search summary plot has the "X" marked on the edge of the plot (as in the example shown) then it is possible that re-running the analysis with additional values included for that parameter direction would lead to a more accurate optimal parameter set. For the example shown, this would suggest re-running the analysis with the Cost parameter range including values larger than 100. (However, it is unnecessary in this case since the misclassification error is already zero). Ideally the "X" should occur in the interior of the plot.<br />
<br />
<gallery caption="Fig. 6. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Svmda_paramsearch.png|CV misclassification rate as a function of SVM parameters.<br />
</gallery><br />
<br />
===Possible poor prediction from the optimal SVM model===<br />
In support vector classification (SVC) the LIBSVM package allows classification predictions to be derived two different ways.<br />
<br />
1. The standard method it to calculate the decision function for the new sample and simply assign the class label according to the sign of the decision function (in the case of two-class data). This is equivalent to saying the sample's class is determined by which side of the decision boundary it occurs on.<br />
<br />
2. The second method to predict the class of a new sample was developed in order to also provide probabilities of the sample belonging to each possible class ([http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf], section 8, "Probability Estimates"). In this method the new sample is assigned to the class for which it has the highest probability of belonging to.<br />
<br />
These two prediction methods produce nearly identical predicted class values but in certain cases there are noticeable differences. Test samples which lie very close to the decision boundary on the +1 side, for example, can be given a predicted class by the second method which identifies them incorrectly as the -1 class.<br />
This discrepancy between the two prediction methods becomes most noticeable when the SVM margin becomes very wide and encloses most data points (which are then support vectors). For the simple two-class data used here this is illustrated by comparing the two prediction methods using any gamma value but with a small very small cost (or large nu) parameter in Fig. 7 below, where again the color indicates the actual class of data points and a superimposed ''x'' indicates the predicted class is incorrect for that point. The decision boundary looks reasonable and the simpler method of identifying class by which side of the decision boundary samples occur on gives good predictions (no data points have a superimposed ''x''). The second method, Fig. 7b, completely fails, however, and predicts all samples as belonging to one class (red points are correctly predicted as red, all blue points are marked with an ''x'' indicating they are predicted incorrectly as being red. One approach to avoid such poor SVMs is to not use SVMs where most calibration samples are support vectors (i.e. the margin is very wide relative to the calibration dataset). The support vector fraction can only be checked after building the SVM, however. This problem can be avoided by not using a very small cost parameter value if using c-svc (or by not using a very large nu parameter value in nu-svc) if the ''Probability Estimates'' prediction method is used. (The nu value is a lower bound on the support vector fraction and in practice the actual support vector fraction turns out to be only slightly larger than the nu bounding value. Limiting nu to be 0.9 or smaller should avoid this problem. This is equivalent to using c-svc and using larger values for cost). <br />
<br />
<br />
<gallery caption="Fig. 7. Effect of enabling ''probability estimates'' for ''c-svc'' SVMDA" widths="400px" heights="300px" perrow="2"><br />
File:probEstsOff.png|a) ''Good c-svc model without prob. estimates. cost = 0.001, gamma = 0.01 All ''<br />
File:probEstsOn.png|b) ''Bad c-svc model with prob. estimates. cost = 0.001, gamma = 0.01''<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[svm]], [[plsda]], [[knn]], [[simca]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Svm&diff=11012Svm2020-02-06T22:07:07Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
SVM Support Vector Machine (LIBSVM) for regression. Use SVMDA for SVM classification ([[Svmda]]). Please also look at the [[Svmda]] page since it has more detailed information much of which also applies to SVM for regression.<br />
<br />
===Synopsis===<br />
<br />
:model = svm(x,y,options); %identifies model (calibration step).<br />
:pred = svm(x,model,options); %makes predictions with a new X-block<br />
:pred = svm(x,y,model,options); %performs a "test" call with a new X-block and known y-values<br />
:svm % Launches an Analysis window with SVM as the selected method.<br />
<br />
Please note that the recommended way to build and apply a SVM model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
The SVM function or analysis method performs calibration and application of Support Vector Machine (SVM) regression models. SVM models can be used for regression problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the continuous y-block variable. It is recommended that classification be done through the svmda function.<br />
<br />
Svm is implemented using the LIBSVM package which provides both epsilon-support vector regression (epsilon-SVR) and nu-support vector regression (nu-SVR). Linear and Gaussian Radial Basis Function kernel types are supported by this function.<br />
<br />
Note: Calling svm with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'SVM',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java). Note that the number of support vectors used is given by model.detai.svm.model.l. It is useful to check this because it can indicate overfitting if most of the calibration samples are used as support vectors, or can indicate problems fitting a model if there are no support vectors (and all prediction values will equal a constant value, a weighted mean).<br />
*** model.detail.svm.cvscan: Results of CV parameter scan<br />
*** model.detail.svm.svindices: Indices of X-block samples which are support vectors.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
===Options===<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the SVM model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the svmtype). Compression can make the SVM more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.<br />
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.<br />
* '''svmtype''': [ {'epsilon-svr'} | 'nu-svr' ] Type of SVM to apply. The default is 'epsilon-svr' for regression.<br />
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"<br />
<br />
* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 10;<br />
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5. This option is only used when the "cvi" option is empty.<br />
* '''cvi''': {{}} Standard cross-validation cell (see crossval) defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values. If empty (the default), then random cross-validation is done based on the "splits" option.<br />
<br />
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.<br />
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.<br />
* '''epsilon''': Value(s) to use for LIBSVM 'p' parameter (epsilon in loss function). Default is the set of values [1.0, 0.1, 0.01].<br />
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].<br />
<br />
===Algorithm===<br />
Svm uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options. <br />
<br />
The default SVM parameters cost, epsilon, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however. If you are using the command line SVM function to build a model then the optimal SVM parameters are shown in model.detail.svm.cvscan.best. If you are using the graphical Analysis SVM then the optimal parameters are reported in the summary window which is shown when you mouse-over the model icon, once the model is built.<br />
<br />
====Model building performance====<br />
Building a single SVM model can sometimes be slow, especially if the calibration dataset is large. Using ranges for the SVM parameters to search for the optimal parameter combination increases the final model building time significantly. If cross-validation is used the calculation is again increased, possibly dramatically if the number of CV subsets is large. Some suggestions for faster SVM building include: <br />
:1) Turning CV off ("none") during preliminary analyses. This is MUCH faster and cross-validation is still performed using a default "Random Subsets" with 5 data splits and 1 iteration,<br />
:2) Using a coarse grid of SVM parameter values to search over for optimal values, <br />
:3) Choosing the CV method carefully, at least initially. For example, use "Random Subsets" with a small number of data splits (e.g. 5) and a small "Number of Iterations" (e.g. 1).<br />
:4) Using the compression option if the number of variables is large.<br />
<br />
====epsilon-SVR and nu-SVR====<br />
There are two commonly used versions of SVM regression, 'epsilon-SVR' and 'nu-SVR'. The original SVM formulations for Regression (SVR) used parameters C [0, inf) and epsilon[0, inf) to apply a penalty to the optimization for points which were not correctly predicted. An alternative version of both SVM regression was later developed where the epsilon penalty parameter was replaced by an alternative parameter, nu [0,1], which applies a slightly different penalty. The main motivation for the nu versions of SVM is that it has a has a more meaningful interpretation. This is because nu represents an upper bound on the fraction of training samples which are errors (badly predicted) and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C or epsilon.<br />
Epsilon or nu are just different versions of the penalty parameter. The same optimization problem is solved in either case. Thus it should not matter which form of SVM you use, epsilon or nu. PLS_Toolbox uses epsilon since this was the original formulation and is the most commonly used form. For more details on 'nu' SVM regression see [http://www.csie.ntu.edu.tw/~cjlin/papers/newsvr.pdf]<br />
<br />
The user must provide parameters (or parameter ranges) for SVM regression as:<br />
:*'epsilon-SVR':<br />
::'''epsilon''','''C''', (using linear kernel), or<br />
::'''epsilon''','''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
:*'nu-SVR':<br />
::'''nu''', '''C''', (using linear kernel), or<br />
::'''nu''', '''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
====SVM Parameters====<br />
<br />
* '''cost''': Cost [0 ->inf] represents the penalty associated with errors larger than epsilon. Increasing cost value causes closer fitting to the calibration/training data.<br />
* '''gamma''': Kernel ''gamma'' parameter controls the shape of the separating hyperplane. Increasing gamma usually increases number of support vectors.<br />
* '''epsilon''': In training the regression function there is no penalty associated with points which are predicted within distance epsilon from the actual value. Decreasing epsilon forces closer fitting to the calibration/training data.<br />
* '''nu''': Nu (0 -> 1] indicates a lower bound on the number of support vectors to use, given as a fraction of total calibration samples, and an upper bound on the fraction of training samples which are errors (poorly predicted).<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[ann]], [[mlr]], [[lwr]], [[pls]], [[pcr]], [[svmda]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Annda&diff=11011Annda2020-02-06T22:06:20Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Predictions based on Artificial Neural Network (ANNDA) classification models.<br />
ANNDA Artificial Neural Network (ANNDA) for classification. Use ANN for Artificial Neural Network regression([[Ann]]).<br />
<br />
===Synopsis===<br />
: annda - Launches an Analysis window with ANNDA as the selected method. <br />
: [model] = annda(x, opts); <br />
: [model] = annda(x,y,options);<br />
: [model] = annda(x,y, nhid, options);<br />
: [pred] = annda(x,model,options);<br />
: [valid] = annda(x,model,options);<br />
: [valid] = annda(x,y,model,options); <br />
<br />
Please note that the recommended way to build and apply an ANNDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Build an ANNDA model from input dataset X, or input X and Y if classes are in Y, using the specified number of layers and layer nodes. <br />
Alternatively, if a model is passed in ANNDA makes a prediction for an input test X block. The ANNDA model <br />
contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in <br />
to ANNDA then these weights do not need to be calculated. <br />
<br />
There are two implementations of ANNDA available referred to as 'BPN' and 'Encog'. <br />
:BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.<br />
:Encog is a feedforward ANN using Resilient Backpropagation training. See [http://en.wikipedia.org/wiki/Rprop Rprop] for further details. <br />
Encog is implemented using the Encog framework [http://www.heatonresearch.com/encog Encog] provided by <br />
Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are <br />
available at [http://www.heatonresearch.com/wiki/Main_Page#Encog_Documentation Encog Documentation]. <br />
BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead. <br />
Both implementations should give similar results but one may be faster than the other for different datasets. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (optional) class "double" sample class values,<br />
* '''nhid''' = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'ANNDA',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
====Training Termination====<br />
The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.<br />
<br />
BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV based on the calibration data over a range of learning iterations values (1 to options.learncycles). The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This RMSECV value is based on pre-processed, scaled values and so it is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.<br />
<br />
Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations <br />
becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit). <br />
Note these RMSE values refer to the internal preprocessed and scaled y values.<br />
<br />
====Cross-validation====<br />
Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
* '''display''' : [ 'off' |{'on'}] Governs display<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''blockdetails''' : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.<br />
* '''waitbar''' : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''algorithm''' : [{'bpn'} | 'encog'] ANN implementation to use.<br />
* '''nhid1''' : [{2}] Number of nodes in first hidden layer.<br />
* '''nhid2''' : [{0}] Number of nodes in second hidden layer.<br />
* '''learnrate''' : [0.125] ANN backpropagation learning rate (bpn only).<br />
* '''learncycles''' : [20] Number of ANN learning iterations (bpn only).<br />
* '''terminalrmse''' : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).<br />
* '''terminalrmserate''' : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).<br />
* '''maxseconds''' : [{20}] Maximum duration of ANN training in seconds (encog only).<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANNDA model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANNDA more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''': [{'yes'} | 'no'] Use Mahalnobis Distance corrected.<br />
* '''cvmethod''' : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.<br />
* '''cvsplits''' : [{5}] Number of CV subsets.<br />
* '''cvi''' : ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:<br />
::cvi(i) = -2 the sample is always in the test set.<br />
::cvi(i) = -1 the sample is always in the calibration set,<br />
::cvi(i) = 0 the sample is always never used, and<br />
::cvi(i) = 1,2,3... defines each test subset.<br />
* '''activationfunction''' : For the default algorithm, 'bpn', this option uses a 'sigmoid' activation function, f(x) = 1/(1+exp(-x)). For the 'encog' algorithm this activationfunction option has two choices, 'tanh' as default, or 'sigmoid'.<br />
<br />
===Additional information on the ‘BPN’ ANNDA implementation===<br />
The “BPN” implementation of ANNDA is a conventional feedforward back-propagation neural network where the weights are updated, or ‘trained’, so as to reduce the magnitude of the prediction error, except that the gradient-descent method of updating the weights is different from the usual “delta rule” approach. In the traditional delta-rule method the weights are changed at each increment of training time by a constant fraction of the contributing error gradient terms, leading to a reduced prediction error. In this “BPN” implementation the search for optimal weights by gradient-descent is treated as a continuous system, rather than incremental. The evolution of the weights with respect to training time is solved as a set of differential equations using a solver appropriate for systems where the solution (weights) may involve very different timescales. Most weights evolve slowly towards their final values but some weights may have periods of faster change. A reference paper for the BPN implementation is:<br />
<br />
Owens A J and Filkin D L 1989 Efficient training of the back propagation network by solving a system of stiff<br />
ordinary differential equations Proc. Int. Joint Conf. on Neural Networks vol II (IEEE Press) pp 381–6.<br />
<br />
====Algorithm parameters: learncycles and learnrate====<br />
This BPN technique results in much faster training that with the traditional delta-rule approach. The training is governed by two parameters, ‘learncycles’ and ‘learnrate’. The learnrate parameter specifies the training time duration of the first learncycle. Each subsequent learncycle’s time duration is twice the previous learncycle’s duration. The performance of the ANN is evaluated at the end of each learncycle interval by calculating the cross-validation prediction error, RMSECV. The RMSECV initially decreases rapidly with training time but eventually starts to increase again as the ANN begins to overfit the data. The number of training cycles which yields the minimum RMSECV therefore provides an estimate of the optimal ANN training duration, for the given learnrate value. The ANN model contains these RMSECV values in model.detail.ann.rmsecviter, and the optimal, minimum RMSECV occurs at index model.detail.ann.niter, which will be smaller than or equal to the learncycles value. It is useful to check rmsecviter to see if a minimum RMSECV has been attained, but also to see if you are using too many learn cycles. Reducing the number of learncycles can significantly speed up ANN training.<br />
Note, the model.detail.ann.rmsecviter values are only used to pick the optimal number of learncycles. These rmsecviter values are calculated using scaled y and should not be compared to the reported RMSEC, RMSECV or RMSEP.<br />
<br />
====Usage from ANNDA Analysis window====<br />
<br />
The command line function “annda” has input parameter “nhid” specifying the number of nodes in the hidden layer(s) and builds the optimal model for that network. When using the ANNDA Analysis window, however, it is possible to specify a scan over a range of hidden layer nodes to use. This is enabled by setting the “Maximum number of Nodes” value in the cross-validation window. This only works for BPN ANNDAs having a single hidden layer. This causes ANNDA models to be built for the range of hidden layer nodes up to the specified number and the resulting RMSECV plotted versus the number of nodes is shown by clicking on the “Plot cross-validation results” plot icon in the ANNDA Analysis window’s toolbar. This can be useful for deciding how many nodes to use. Note that this plot is only advisory. The resulting model is built with the input parameter number of nodes, ‘nhid’, and its model.detail.rmsecv value relates to this number of nodes. It is important to check for the optimal number of nodes to use in the ANN but this feature can greatly lengthen the time taken to build the ANNDA model and should be be set = 1 once the number of hidden nodes is decided.<br />
<br />
====Summary of model building speed-up settings====<br />
<br />
=====From the Analysis window:=====<br />
ANNDA in PLS_Toolbox or Solo version 8.2 can be very slow if you use cross-validation (CV). This is mostly due to the CV settings window also specifying a test to find the optimal number of hidden layer 1 nodes, testing ANN models with 1, 2, …,20 nodes, each with CV. This is set by the top slider field “Maximum Number of Nodes L1”. For example, if you want to build an ANN model with 4 layer 1 nodes (using the “ANNDA Settings” field) but leave the CV settings window’s top slider set = 20, then you will actually build 20 models, each with CV, and save the RMSECV from each. This can be very slow, especially for the models with many nodes.<br />
<br />
To make ANNDA perform faster it is recommended that you drag this CV window’s “Maximum Number of Nodes L1” slider to the left, setting = 1, unless you really want to see the results of such a parameter search over the range specified by this slider. This is the default in PLS_Toolbox and Solo versions after version 8.2. The RMSECV versus number of Layer 1 Nodes can be seen by clicking on the “Plot cross-validation results” icon (next to the Scores Plot icon).<br />
<br />
Summary: To make ANNDA perform faster:<br />
<br />
1. Move the top CV slider to the left, setting value = 1.<br />
<br />
2. Turning CV off or using a small number of CV splits.<br />
<br />
3. Choose to use a small number of L1 nodes in the ANNDA settings window.<br />
<br />
4. Don't use 2 hidden layers. This is very slow.<br />
<br />
=====From the command line=====<br />
1. Initially build ANNDA without cross-validation so as to decide on values for learnrate and learncycles by examining where the minimum value of model.detail.ann.rmscviter occurs versus learncycles. Note this uses a single-split CV to estimate rmsecv when the ANNDA cross-validation is set as "None". It is inefficient to use a larger than necessary value for option "learncycles".<br />
<br />
2. Determine the number of hidden layer nodes to use by building a range of models with different number of nodes, nhid1, nhid2. If using the ANNDA Analysis window and the ANN has a single hidden layer then this can be done conveniently by using the “Maximum number of Nodes L1” setting in the cross-validation settings window. It is best to use a simple cross-validation at this stage, with a small number of splits and iterations at this survey stage.<br />
<br />
===See Also===<br />
<br />
[[annda]], [[analysis]], [[crossval]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Ann&diff=11010Ann2020-02-06T21:58:43Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Predictions based on Artificial Neural Network (ANN) regression models.<br />
<br />
===Synopsis===<br />
: ann - Launches an Analysis window with ANN as the selected method. <br />
: [model] = ann(x,y,options);<br />
: [model] = ann(x,y, nhid, options);<br />
: [pred] = ann(x,model,options);<br />
: [valid] = ann(x,y,model,options);<br />
<br />
Please note that the recommended way to build and apply an ANN model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]].<br />
<br />
===Description===<br />
<br />
Build an ANN model from input X and Y block data using the specified number of layers and layer nodes. <br />
Alternatively, if a model is passed in ANN makes a Y prediction for an input test X block. The ANN model <br />
contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in <br />
to ANN then these weights do not need to be calculated. <br />
<br />
There are two implementations of ANN available referred to as 'BPN' and 'Encog'. <br />
:BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.<br />
:Encog is a feedforward ANN using Resilient Backpropagation training. See [http://en.wikipedia.org/wiki/Rprop Rprop] for further details. <br />
Encog is implemented using the Encog framework [http://www.heatonresearch.com/encog Encog] provided by <br />
Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are <br />
available at [http://www.heatonresearch.com/wiki/Main_Page#Encog_Documentation Encog Documentation]. <br />
BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead. <br />
Both implementations should give similar results but one may be faster than the other for different datasets. <br />
BPN is currently the only version which calculates RMSECV.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,<br />
* '''nhid''' = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'ANN',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
====Training Termination====<br />
The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.<br />
<br />
BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV based on the calibration data over a range of learning iterations values (1 to options.learncycles). The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This RMSECV value is based on pre-processed, scaled values and so it is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.<br />
<br />
Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations <br />
becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit). <br />
Note these RMSE values refer to the internal preprocessed and scaled y values.<br />
<br />
====Cross-validation====<br />
Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
* '''display''' : [ 'off' |{'on'}] Governs display<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''blockdetails''' : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.<br />
* '''waitbar''' : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''algorithm''' : [{'bpn'} | 'encog'] ANN implementation to use.<br />
* '''nhid1''' : [{2}] Number of nodes in first hidden layer.<br />
* '''nhid2''' : [{0}] Number of nodes in second hidden layer.<br />
* '''learnrate''' : [0.125] ANN backpropagation learning rate (bpn only).<br />
* '''learncycles''' : [20] Number of ANN learning iterations (bpn only).<br />
* '''terminalrmse''' : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).<br />
* '''terminalrmserate''' : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).<br />
* '''maxseconds''' : [{20}] Maximum duration of ANN training in seconds (encog only).<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANN model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANN more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''': [{'yes'} | 'no'] Use Mahalnobis Distance corrected.<br />
* '''cvmethod''' : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.<br />
* '''cvsplits''' : [{5}] Number of CV subsets.<br />
* '''cvi''' : ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:<br />
::cvi(i) = -2 the sample is always in the test set.<br />
::cvi(i) = -1 the sample is always in the calibration set,<br />
::cvi(i) = 0 the sample is always never used, and<br />
::cvi(i) = 1,2,3... defines each test subset.<br />
* '''activationfunction''' : For the default algorithm, 'bpn', this option uses a 'sigmoid' activation function, f(x) = 1/(1+exp(-x)). For the 'encog' algorithm this activationfunction option has two choices, 'tanh' as default, or 'sigmoid'.<br />
<br />
===Additional information on the ‘BPN’ ANN implementation===<br />
The “BPN” implementation of ANN is a conventional feedforward back-propagation neural network where the weights are updated, or ‘trained’, so as to reduce the magnitude of the prediction error, except that the gradient-descent method of updating the weights is different from the usual “delta rule” approach. In the traditional delta-rule method the weights are changed at each increment of training time by a constant fraction of the contributing error gradient terms, leading to a reduced prediction error. In this “BPN” implementation the search for optimal weights by gradient-descent is treated as a continuous system, rather than incremental. The evolution of the weights with respect to training time is solved as a set of differential equations using a solver appropriate for systems where the solution (weights) may involve very different timescales. Most weights evolve slowly towards their final values but some weights may have periods of faster change. A reference paper for the BPN implementation is:<br />
<br />
Owens A J and Filkin D L 1989 Efficient training of the back propagation network by solving a system of stiff<br />
ordinary differential equations Proc. Int. Joint Conf. on Neural Networks vol II (IEEE Press) pp 381–6.<br />
<br />
====Algorithm parameters: learncycles and learnrate====<br />
This BPN technique results in much faster training that with the traditional delta-rule approach. The training is governed by two parameters, ‘learncycles’ and ‘learnrate’. The learnrate parameter specifies the training time duration of the first learncycle. Each subsequent learncycle’s time duration is twice the previous learncycle’s duration. The performance of the ANN is evaluated at the end of each learncycle interval by calculating the cross-validation prediction error, RMSECV. The RMSECV initially decreases rapidly with training time but eventually starts to increase again as the ANN begins to overfit the data. The number of training cycles which yields the minimum RMSECV therefore provides an estimate of the optimal ANN training duration, for the given learnrate value. The ANN model contains these RMSECV values in model.detail.ann.rmsecviter, and the optimal, minimum RMSECV occurs at index model.detail.ann.niter, which will be smaller than or equal to the learncycles value. It is useful to check rmsecviter to see if a minimum RMSECV has been attained, but also to see if you are using too many learn cycles. Reducing the number of learncycles can significantly speed up ANN training.<br />
Note, the model.detail.ann.rmsecviter values are only used to pick the optimal number of learncycles. These rmsecviter values are calculated using scaled y and should not be compared to the reported RMSEC, RMSECV or RMSEP.<br />
<br />
====Usage from ANN Analysis window====<br />
<br />
The command line function “ann” has input parameter “nhid” specifying the number of nodes in the hidden layer(s) and builds the optimal model for that network. When using the ANN Analysis window, however, it is possible to specify a scan over a range of hidden layer nodes to use. This is enabled by setting the “Maximum number of Nodes” value in the cross-validation window. This only works for BPN ANNs having a single hidden layer. This causes ANN models to be built for the range of hidden layer nodes up to the specified number and the resulting RMSECV plotted versus the number of nodes is shown by clicking on the “Plot cross-validation results” plot icon in the ANN Analysis window’s toolbar. This can be useful for deciding how many nodes to use. Note that this plot is only advisory. The resulting model is built with the input parameter number of nodes, ‘nhid’, and its model.detail.rmsecv value relates to this number of nodes. It is important to check for the optimal number of nodes to use in the ANN but this feature can greatly lengthen the time taken to build the ANN model and should be be set = 1 once the number of hidden nodes is decided.<br />
<br />
====Summary of model building speed-up settings====<br />
<br />
=====From the Analysis window:=====<br />
ANN in PLS_Toolbox or Solo version 8.2 can be very slow if you use cross-validation (CV). This is mostly due to the CV settings window also specifying a test to find the optimal number of hidden layer 1 nodes, testing ANN models with 1, 2, …,20 nodes, each with CV. This is set by the top slider field “Maximum Number of Nodes L1”. For example, if you want to build an ANN model with 4 layer 1 nodes (using the “ANN Settings” field) but leave the CV settings window’s top slider set = 20, then you will actually build 20 models, each with CV, and save the RMSECV from each. This can be very slow, especially for the models with many nodes.<br />
<br />
To make ANN perform faster it is recommended that you drag this CV window’s “Maximum Number of Nodes L1” slider to the left, setting = 1, unless you really want to see the results of such a parameter search over the range specified by this slider. This is the default in PLS_Toolbox and Solo versions after version 8.2. The RMSECV versus number of Layer 1 Nodes can be seen by clicking on the “Plot cross-validation results” icon (next to the Scores Plot icon).<br />
<br />
Summary: To make ANN perform faster:<br />
<br />
1. Move the top CV slider to the left, setting value = 1.<br />
<br />
2. Turning CV off or using a small number of CV splits.<br />
<br />
3. Choose to use a small number of L1 nodes in the ANN settings window.<br />
<br />
4. Don't use 2 hidden layers. This is very slow.<br />
<br />
=====From the command line=====<br />
1. Initially build ANN without cross-validation so as to decide on values for learnrate and learncycles by examining where the minimum value of model.detail.ann.rmscviter occurs versus learncycles. Note this uses a single-split CV to estimate rmsecv when the ANN cross-validation is set as "None". It is inefficient to use a larger than necessary value for option "learncycles".<br />
<br />
2. Determine the number of hidden layer nodes to use by building a range of models with different number of nodes, nhid1, nhid2. If using the ANN Analysis window and the ANN has a single hidden layer then this can be done conveniently by using the “Maximum number of Nodes L1” setting in the cross-validation settings window. It is best to use a simple cross-validation at this stage, with a small number of splits and iterations at this survey stage.<br />
<br />
===See Also===<br />
<br />
[[annda]], [[analysis]], [[crossval]], [[lwr]], [[modelselector]], [[pls]], [[pcr]], [[preprocess]], [[svm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mpca&diff=11009Mpca2020-02-06T21:57:44Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multi-way (unfold) principal components analysis.<br />
<br />
===Synopsis===<br />
<br />
:model = mpca(mwa,ncomp,''options'')<br />
:model = mpca(mwa,ncomp,preprostring)<br />
:pred = mpca(mwa,model,''options'')<br />
:mpca - Launches an analysis window with MPCA as the selected method.<br />
<br />
Please note that the recommended way to build and apply a MPCA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Principal Components Analysis of multi-way data using unfolding to a 2-way matrix followed by conventional PCA.<br />
<br />
Inputs to MPCA are the multi-way array mwa (class "double" or "dataset") and the number of components to use in the model nocomp. To make predictions with new data the inputs are the multi-way array mwa and the MPCA model model. Optional input ''options'' is discussed below.<br />
<br />
For assistance in preparing batch data for use in MPCA please see [[bspcgui]].<br />
<br />
The output model is a structure array with the following fields:<br />
<br />
* '''modeltype''': 'MPCA',<br />
<br />
* '''datasource''': structure array with information about the x-block,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': 1 by 2 cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for each input data block (this is empty if options.blockdetail = 'normal'),<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
'''options''' = a structure array with the following fields.<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting,<br />
<br />
* '''outputversion''': [ 2 | {3} ] governs output format,<br />
<br />
* '''preprocessing''': { [] } preprocessing structure, {default is mean centering i.e. options.preprocessing = preprocess('default', 'mean center')} (see PREPROCESS),<br />
<br />
* '''algorithm''': [ {'svd'} | 'maf' | 'robustpca' ], algorithm for decomposition, Algorithm 'maf' requires Eigenvector's MIA_Toolbox.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidencelimits.<br />
<br />
* '''roptions''': structure of options to pass to robpca (robust PCA engine from the Libra Toolbox).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''samplemode''': [ {3} ] mode (dimension) to use as the sample mode e.g. if it is 3 then it is assumed that mode 3 is the sample/object dimension i.e. if mwa is 7x9x10 then the scores model.loads{1} will have 10 rows (it will be 10xncomp).<br />
<br />
The default options can be retreived using: options = mpca('options');.<br />
<br />
It is also possible to input just the preprocessing option as an ordinary string in place of ''options'' and have the remainder of options filled in with the defaults from above. The following strings are valid:<br />
<br />
: ''''none'''': no scaling,<br />
<br />
: ''''auto'''': unfolds array then applies autoscaling,<br />
<br />
: ''''mncn'''': unfolds array then applies mean centering, or<br />
<br />
: ''''grps'''': {default} unfolds array then group/block scales each variable, i.e. the same variance scaling is used for each variable along its time trajectory (see GSCALE).<br />
<br />
MPCA will work with arrays of order 3 and higher. For higher order arrays, the last order is assumed to be the sample order, ''i.e.'' for an array of order ''n'' with the dimension of order ''n'' being ''m'', the unfolded matrix will have ''m'' samples. For arrays of higher order the group scaling option will group together all data with the same order 2 index, for multiway array mwa, each mwa(:,j,:, ... ,:) will be scaled as a group.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[bspcgui]], [[evolvfa]], [[ewfa]], [[explode]], [[npls]], [[parafac]], [[parafac2]], [[pca]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Parafac&diff=11008Parafac2020-02-06T21:56:59Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
PARAFAC (PARAllel FACtor analysis) for multi-way arrays<br />
<br />
===Synopsis===<br />
<br />
:model = parafac(X,ncomp,''initval,options'')<br />
:pred = parafac(Xnew,model)<br />
:parafac % Launches an analysis window with Parafac as the selected method<br />
<br />
Please note that the recommended way to build and apply a PARAFAC model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PARAFAC will decompose an array of order ''N'' (where ''N'' >= 3) into the summation over the outer product of ''N'' vectors (a low-rank model). E.g. if ''N''=3 then the array is size ''I'' by ''J'' by ''K''. An example of three-way fluorescence data is shown below..<br />
<br />
For example, twenty-seven samples containing different amounts of dissolved hydroquinone, tryptophan, phenylalanine, and dopa are measured spectrofluoremetrically using 233 emission wavelengths (250-482 nm) and 24 excitation wavelengths (200-315 nm each 5 nm). A typical sample is also shown.<br />
<br />
[[Image:Parafacdata.gif]]<br />
<br />
A four-component PARAFAC model of these data will give four factors, each corresponding to one of the chemical analytes. This is illustrated graphically below. The first mode scores (loadings in mode 1) in the matrix '''A''' (27x4) contain estimated relative concentrations of the four analytes in the 27 samples. The second mode loadings '''B''' (233x4) are estimated emission loadings and the third mode loadings '''C''' (24x4) are estimated excitation loadings.<br />
<br />
[[Image:Parafacresults.gif]]<br />
<br />
For more information about how to use PARAFAC, see the [http://www.youtube.com/user/QualityAndTechnology/videos?view=1&flow=grid University of Copenhagen's Multi-Way Analysis Videos].<br />
<br />
In the PARAFAC algorithm, any missing values must be set to NaN or Inf and are then automatically handled by expectation maximization. This routine employs an alternating least squares (ALS) algorithm in combination with a line search. For 3-way data, the initial estimate of the loadings is usually obtained from the tri-linear decomposition (TLD).<br />
<br />
For assistance in preparing batch data for use in PARAFAC please see [[bspcgui]].<br />
<br />
====Inputs====<br />
<br />
* '''x''' = the multiway array to be decomposed, and<br />
<br />
* '''ncomp''' = <br />
:* the number of factors (components) to use, OR<br />
:* a cell array of parameters such as {a,b,c} which will then be used as starting point for the model. The cell array must be the same length as the number of modes and element j contain the scores/loadings for that mode. If one cell element is empty, this mode is guessed based on the remaining modes.<br />
<br />
====Optional Inputs====<br />
<br />
* '''''initval''''' = <br />
:* If a parafac model is input, the data are fit to this model where the loadings for the first mode (scores) are estimated. <br />
:* If the loadings are input (e.g. model.loads) these are used as starting values.<br />
<br />
*'''''options''''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
The output model is a structure array with the following fields:<br />
<br />
* '''modeltype''': 'PARAFAC',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': 1 by ''K'' cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for each input data block,<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.<br />
The output pred is a structure array that contains the approximation of the data if the options field blockdetails is set to 'all' (see next).<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ {'on'} | 'off' ], governs level of display,<br />
<br />
* '''plots''': [ {'final'} | 'all' | 'none' ], governs level of plotting,<br />
<br />
* '''weights''': [], used for fitting a weighted loss function (discussed below),<br />
<br />
* '''stopcriteria''': Structure defining when to stop iterations based on any one of four criteria<br />
<br />
:* '''relativechange''': Default is 1e-6. When the relative change in fit gets below the threshold, the algorithm stops.<br />
:* '''absolutechange''': Default is 1e-6. When the absolute change in fit gets below the threshold, the algorithm stops.<br />
:* '''iterations''': Default is 10.000. When the number of iterations exceeds the threshold, the algorithm stops.<br />
:* '''seconds''': Default is 3600 (seconds). When the time spent exceeds the threshold, the algorithm stops.<br />
<br />
* '''init''': [ 0 ], defines how parameters are initialized (discussed below),<br />
<br />
* '''line''': [ 0 | {1}] defines whether to use the line search {default uses it},<br />
<br />
* '''algo''': [ {'ALS'} | 'tld' | 'swatld' ] governs algorithm used. Only ALS allows more than three-way and allows constraints,<br />
<br />
* '''iterative''': settings for iterative reweighted least squares fitting (see help on weights below),<br />
<br />
* '''validation.splithalf''': [ 'on' | {'off'} ], Allows doing [[splithalf]] analysis. See the help of SPLITHALF for more information,<br />
<br />
* '''auto_outlier.perform''': [ 'on' | {'off'} ], Will automatically remove detected outliers in an iterative fashion. See auto_outlier.help for more information,<br />
<br />
* '''scaletype''': Defines how loadings are scaled. See options.scaletype.text for help,<br />
<br />
* '''blockdetails''': [ {'standard'} | 'compact' | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = like 'Standard' only residual limits from old model is used and the core consistency field in the model structure is left empty. ('model.detail.reslim', 'model.detail.coreconsistency.consistency').<br />
:* 'All' = keep predictions, raw residuals for x-block as well as the X-block dataset itself.<br />
<br />
* '''preprocessing''': {[]}, one element cell array containing preprocessing structure (see PREPROCESS) defining preprocessing to use on the x-block <br />
<br />
* '''samplemode''': [1], defines which mode should be considered the sample or object mode,<br />
<br />
* '''constraints''': {3x1 cell}, defines constraints on parameters (discussed below),<br />
<br />
* '''coreconsist''': [ {'on'} | 'off' ], governs calculation of core consistency (turning off may save time with large data sets and many components), and<br />
<br />
* '''waitbar''': [ {'on'} | 'off' ], display waitbar. <br />
<br />
The default options can be retrieved using: options = parafac('options');.<br />
<br />
=====Weights=====<br />
<br />
Through the use of the ''options'' field weights it is possible to fit a PARAFAC model in a weighted least squares sense The input is an array of the same size as the input data X holding individual weights for each element. The PARAFAC model is then fit in a weighted least squares sense. Instead of minimizing the frobenius norm ||x-M||<sup>2</sup> where M is the PARAFAC model, the norm ||(x-M).*weights||<sup>2</sup> is minimized. The algorithm used for weighted regression is based on a majorization step according to Kiers, ''Psychometrika'', '''62''', 251-266, 1997 which has the advantage of being computationally inexpensive.<br />
<br />
=====Init=====<br />
<br />
The ''options'' field init is used to govern how the initial guess for the loadings is obtained. If optional input ''initval'' is input then options.init is not used. The following choices for init are available.<br />
<br />
Generally, options.init = 0, will do for well-behaved data whereas options.init = 10, will be suitable for difficult models. Difficult models are typically those with many components, with very correlated loadings, or models where there are indications that local minima are present.<br />
<br />
* '''init''' = 0, PARAFAC chooses initialization {default},<br />
<br />
* '''init''' = 1, uses TLD (unless data is more than three-way. Then ATLD is used),<br />
<br />
* '''init''' = 2, based on singular value decomposition (good alternative to 1), <br />
<br />
* '''init''' = 3, based on orthogonalization of random values (good for checking local minima),<br />
<br />
* '''init''' = 4, based on approximate (sequentially fitted) PARAFAC model, <br />
<br />
* '''init''' = 5, based on compression which may be useful for large data, and<br />
<br />
* '''init''' > 5, based on best fit of many (the value options.init) small runs.<br />
<br />
=====Constraints=====<br />
<br />
The ''options'' field constraints is used to employ constraints on the parameters. It is a cell array with number of elements equal to the number of modes of the input data X. Each cell contains a structure array that defines the constraints in that particular mode. Hence, options.constraints{2} defines constraints on the second mode loadings. For help on setting constraints see [[constrainfit]]. Note, that if your dataset is e.g. a five-way array, then the default constraint field in options only defines the first three modes. You will have to make the constraint field for the remaining modes yourself. This can be done by copying from the other modes. For example, options.constraints{4} = options.constraints{1};options.constraints{5} = options.constraints{1};<br />
<br />
===Examples===<br />
<br />
parafac demo gives a demonstration of the use of the PARAFAC algorithm.<br />
<br />
model = parafac(X,5) fits a five-component PARAFAC model to the array X using default settings.<br />
<br />
pred = parafac(Z,model) fits a parafac model to new data Z. The scores will be taken to be in the first mode, but you can change this by setting options.samplemodex to the mode which is the sample mode. Note, that the sample-mode dimension may be different for the old model and the new data, but all other dimensions must be the same.<br />
<br />
options = parafac('options'); generates a set of default settings for PARAFAC. options.plots = 0; sets the plotting off.<br />
<br />
options.init = 3; sets the initialization of PARAFAC to orthogonalized random numbers.<br />
<br />
options.samplemodex = 2; Defines the second mode to be the sample-mode. Useful, for example, when fitting an existing model to new data has to provide the scores in the second mode.<br />
<br />
model = parafac(X,2,options); fits a two-component PARAFAC model with the settings defined in options. <br />
<br />
parafac io shows the I/O of the algorithm.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[bspcgui]], [[datahat]], [[eemoutlier]], [[explode]], [[gram]], [[mpca]], [[npls]], [[outerm]], [[parafac2]], [[pca]], [[preprocess]], [[splithalf]], [[tld]], [[tucker]], [[unfoldm]], [[modelviewer]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mlr&diff=11007Mlr2020-02-06T21:55:53Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multiple Linear Regression for multivariate Y.<br />
<br />
===Synopsis===<br />
<br />
:model = mlr(x,y,options)<br />
:pred = mlr(x,model,options)<br />
:valid = mlr(x,y,model,options)<br />
:mlr % Launches analysis window with MLR as the selected method.<br />
<br />
Please note that the recommended way to build and apply a MLR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
MLR identifies models of the form Xb = y + e.<br />
<br />
====Inputs====<br />
<br />
* '''y''' = X-block: predictor block (2-way array or DataSet Object)<br />
<br />
* '''y''' = Y-block: predictor block (2-way array or DataSet Object)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = scalar, estimate of filtered data.<br />
<br />
* '''pred''' = structure array with predictions<br />
<br />
* '''valid''' = structure array with predictions<br />
<br />
===Options ===<br />
<br />
'''options''' = a structure array with the following fields.<br />
<br />
* '''display''': [ {'off'} | 'on'] Governs screen display to command line.<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
<br />
* '''ridge''': [ 0 ] ridge parameter to use in regularizing the inverse.<br />
<br />
* '''preprocessing''': { [] [] } preprocessing structure (see PREPROCESS).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
====Studentized Residuals====<br />
From version 8.8 onwards, the Studentized Residuals shown for MLR Scores Plot are now calculated for calibration samples as:<br />
MSE = sum((res).^2)./(m-1);<br />
syres = res./sqrt(MSE.*(1-L));<br />
where res = y residual, m = number of samples, and L = sample leverage.<br />
This represents a constant multiplier change from how Studentized Residuals were previously calculated.<br />
For test datasets, where pres = predicted y residual, the semi-Studentized residuals are calculated as:<br />
MSE = sum((res).^2)./(m-1);<br />
syres = pres./sqrt(MSE);<br />
This represents a constant multiplier change from how the semi-Studentized Residuals were previously calculated.<br />
<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[ils_esterror]], [[modelstruct]], [[pcr]], [[pls]], [[preprocess]], [[ridge]], [[testrobustness]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mcr&diff=11006Mcr2020-02-06T21:55:20Z<p>Lyle: </p>
<hr />
<div><br />
===Purpose===<br />
<br />
Multivariate curve resolution with constraints.<br />
<br />
===Synopsis===<br />
<br />
:model = mcr(x,ncomp,''options'') %calibrate <br />
:model = mcr(x,c0,''options'') %calibrate with explict initial guess<br />
:pred = mcr(x,model,''options'') %predict<br />
:mcr % Launches an Analysis window with mcr as the selected method.<br />
<br />
Please note that the recommended way to build and apply a MCR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
MCR decomposes a matrix '''X''' as '''CS''' such that '''X''' = '''CS''' + '''E''' where '''E''' is minimized in a least squares sense. By default, this is done using the alternating least squares (ALS) algorithm. For details on the ALS algorithm and constraints available in MCR, see the [[als]] reference page.<br />
<br />
When called with new data and a model structure, MCR performs a prediction (applies the model to the new data) returning the projection of the new data onto the previously recovered loadings (i.e. estimated spectra).<br />
<br />
In addition to the constraints and options listed in [[als]], other pages which may be of interest include [[MCR Constraints]] which describes setting constraints in the [Analysis] interface, and [[MCR Contrast Constraint]] which discusses the contrast constraint option.<br />
<br />
====Inputs====<br />
* '''x''' = the matrix to be decomposed (size ''m'' by ''n'')<br />
* '''ncomp''' or '''c0''' or '''model''' :<br />
** '''ncomp''' = the number of components to extract<br />
** '''c0''' = the explicit initial guess where, if c0 is size ''m'' by ''k'', where ''k'' is the number of factors, then it is assumed to be the initial guess for '''C'''. If c0 is size ''k'' by ''n'' then it is assumed to be the initial guess for '''S'''. If ''m''=''n'' then, c0 is assumed to be the initial guess for '''C'''. Optional input ''options'' is described below.<br />
** '''model''' = a previously calculated MCR model structure to apply to the data in input '''x'''.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure containing the results of the analysis. The estimated contributions '''C '''are stored in model.loads{2} and the estimated spectra '''S '''in model.loads{1}. Sum-squared residuals for samples and variables can be found in model.ssqresiduals{1} and model.ssqresiduals{2}, respectively. See the chemometrics tutorial for more information on the MCR method and models. Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.<br />
<br />
===Options===<br />
<br />
* '''''options''''' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
<br />
* '''waitbar''': [ 'off' | 'on' | {'auto'} ] governs use of waitbar,<br />
<br />
* '''preprocessing''': { [] } preprocessing to apply to x-block (see PREPROCESS).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''initmode''': [1 | 2] Mode of x for automatic initialization.<br />
<br />
* '''confidencelimit''': [{0.95}] Confidence level for Q limits. <br />
<br />
* '''alsoptions''': ['options'] options passed to ALS subroutine (see ALS).<br />
<br />
The default options can be retreived using: options = mcr('options');.<br />
<br />
===See Also===<br />
<br />
[[als]], [[analysis]], [[evolvfa]], [[ewfa]], [[fasternnls]], [[fastnnls]], [[fastnnls_sel]], [[mlpca]], [[parafac]], [[parafac2]], [[plotloads]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pcr&diff=11005Pcr2020-02-06T21:54:15Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Principal Components Regression: multivariate inverse least squares regression.<br />
<br />
===Synopsis===<br />
<br />
:model = pcr(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pcr(x,model,''options'') %applies model to a new X-block<br />
:valid = pcr(x,y,model,''options'') %applies model to a new X-block, with corresponding new Y values<br />
:pcr % Launches an Analysis window with PCR as the selected method.<br />
<br />
Please note that the recommended way to build and apply a PCR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PCR calculates a single principal components regression model using the given number of components <tt>ncomp</tt> to predict <tt>y</tt> from measurements <tt>x</tt>, OR applies an existing PCR model to a new set of data <tt>x</tt><br />
<br />
To make predictions, the inputs are <tt>x</tt> the new predictor x-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model. The output <tt>pred</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, etc. for the new data.<br />
<br />
If new y-block measurements are also available for the new data, then the inputs are <tt>x</tt> the new x-block (2-way array class "double" or "dataset"), <tt>y</tt> the new y-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model to apply. The output <tt>valid</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, and additional y-block statistics etc. for the new data.<br />
<br />
In prediction and validation modes, the same model structure is used but predictions are provided in the <tt>model.detail.pred</tt> field.<br />
<br />
Note: Calling '''pcr''' with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block data (2-way array or DataSet Object)<br />
* '''y''' = Y-block data (2-way array or DataSet Object)<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' discussed below<br />
<br />
====Outputs====<br />
<br />
The output is a standard model structure with the following fields (see [[Standard Model Structure]]):<br />
<br />
* '''modeltype''': 'PCR',<br />
* '''datasource''': structure array with information about input data,<br />
* '''date''': date of creation,<br />
* '''time''': time of creation,<br />
* '''info''': additional model information,<br />
* '''reg''': regression vector,<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
* '''pred''': 2 element cell array containing model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array), and the y-block predictions.<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
* '''description''': cell array with text description of model, and<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting,<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively),<br />
<br />
* '''algorithm''': [ {'svd'} | ' robustpcr' | ' correlationpcr' | 'frpcr' ], governs which algorithm to use.<br />
** 'svd' = standard singular value decomposition algorithm. <br />
** 'robustpcr' = robust algorithm with automatic outlier detection. <br />
** 'correlationpcr' = standard PCR with re-ordering of factors in order of y-variance captured.<br />
** 'frpcr' = full-ratio PCR (a.k.a. optimized scaling) with automatic sample scale correction. Note that with FRPCR, models generally perform better without mean-centering on the x-block.<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits,<br />
<br />
* '''roptions''': structure of options to pass to '''rpcr''' (robust PCR engine from the Libra Toolbox). Only used when algorithm is 'robustpcr',<br />
<br />
* '''alpha''' : [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpcr'.<br />
<br />
* '''intadjust''' : [ {0} ], if equal to one, the intercept adjustment for the LTS-regression will be calculated. See '''ltsregres''' for details (Libra Toolbox).<br />
<br />
The default options can be retreived using: options = pcr('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,t,p] = pcr(x,y,ncomp,''options'')<br />
<br />
where the outputs are<br />
<br />
* '''b''' = matrix of regression vectors or matrices for each number of principal components up to ncomp,<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''t''' = x-block scores, and<br />
<br />
* '''p''' = x-block loadings.<br />
<br />
Note: The regression matrices are ordered in '''b''' such that each ''Ny'' (number of y-block variables) rows correspond to the regression matrix for that particular number of principal components.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[frpcr]], [[mlr]], [[modelstruct]], [[pca]], [[pls]], [[preprocess]], [[ridge]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Plsda&diff=11004Plsda2020-02-06T21:53:23Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial least squares discriminant analysis.<br />
<br />
===Synopsis===<br />
:plsda - Launches an Analysis window with the PLSDA method selected<br />
:model = plsda(x,y,ncomp,''options'')<br />
:model = plsda(x,ncomp,''options'')<br />
:pred = plsda(x,model,''options'')<br />
:valid = plsda(x,y,model,''options'')<br />
<br />
Please note that the recommended way to build and apply a PLSDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PLSDA is a multivariate inverse least squares discrimination method used to classify samples. The y-block in a PLSDA model indicates which samples are in the class(es) of interest through either:<br />
<br />
*(A) a column vector of class numbers indicating class assignments:<br />
<br />
y = [1 1 3 2]';<br />
<br />
:'''NOTE:''' if classes are assigned in the input (x), y can be omitted and this option will be assumed using the first class set of the x-block rows (or other set if the option "classset" is used). For information on assigning classes to the X-block, see [[Assigning Sample Classes]].<br />
<br />
*(B) a matrix of one or more columns containing a logical zero (= not in class) or one (= in class) for each sample (row):<br />
<pre><br />
y = [1 0 0;<br />
1 0 0;<br />
0 0 1;<br />
0 1 0]<br />
</pre><br />
<br />
:'''NOTE''': When a vector of class numbers is used (case A, above), class zero (0) is reserved for "unknown" samples and, thus, samples of class zero are never used when calibrating a PLSDA model. The model will include predictions for these samples.<br />
<br />
====Probability-based Predictions====<br />
The raw predictions from a PLSDA model is a value of nominally zero or one. A value closer to zero indicates the new sample is NOT in the modeled class; a value of one indicates a sample is in the modeled class. In practice a threshold between zero and one is determined above which a sample is in the class and below which a sample is not in the class (See, for example, [[plsdthres]]). Similarly, a probability of a sample being inside or outside the class can be calculated using [[discrimprob]]. The predicted probability of each class as well as class assignments made with various rules can be found in the field:<br />
<br />
:model.classification<br />
<br />
For more details, see [[Sample Classification Predictions]], and the description of the model's classification field in the [[Standard Model Structure]].<br />
<br />
====Threshold-based Predictions====<br />
It is possible to see the classification results based on the sample prediction relative to the threshold for that class. These can differ slightly from the predictions based on probabilities. The probability-based predictions are likely to be more accurate in situations where one class is narrowly distributed in y-prediction range but other classes are broadly distributed and so are more probable for y-prediction values far from the narrow class probable y range (see [http://www.eigenvector.com/faq/index.php?id=38]). <br />
* In the PLSDA Analysis window the threshold-based classification results can viewed by using the menu: "Tools"->"Show Details"->"Model", or by mousing over the model icon. This reports the Sensitivity, Specificity, Class Error for each modeled class. The "Class Err." is defined as the mean of the false positive and false positive rates. (see [https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Definitions definitions]).<br />
* For command line usage these are found in the model object as model.detail.misclassed, a cell array containing a matrix for each class, and model.detail.classerrc. For class j:<br />
<br />
: False positive rate (1 - specificity): model.detail.misclassedc{j}(1, ncomp)<br />
: False negative rate (1 - sensitivity): model.detail.misclassedc{j}(2, ncomp), where ncomp = number of latent variables used in model.<br />
: Class Error: model.detail.classerrc(j, ncomp)<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block), class "double" or "dataset",<br />
* '''y''' = Y-block <br />
** OPTIONAL if '''x''' is a dataset containing classes for sample mode (mode 1)<br />
** otherwise, '''y''' is one of the following:<br />
***(A) column vector of sample classes for each sample in '''x''' <br />
***(B) a logical array with '1' indicating class membership for each sample (rows) in one or more classes (columns), or <br />
***(C) a cell array of class groupings of classes from the x-block data. For example: <tt> {[1 2] [3]} </tt> would model classes 1 and 2 as a single group against class 3.<br />
* '''ncomp''' = the number of latent variables to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' = an optional input options structure (see below)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the PLSDA model (See [[Standard Model Structure]]).<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions, includes known class information (Y block data) of test samples<br />
<br />
Note: Calling '''plsda''' with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]].<br />
<br />
===Options===<br />
<br />
''options'' = a structure that can contain the following fields:<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''orthogonalize''': [ {'off'} | 'on' ] Orthogonalize model to condense y-block variance into first latent variable; 'on' = produce orthogonalized model. Regression vector and predictions are NOT changed by this option, only the loadings, weights, and scores. See [[orthogonalizepls]] for more information.<br />
* '''priorprob''': [ ] Vector of prior probabilities of observing each class. If any class prior is "Inf", the frequency of observation of that class in the calibration is used as its prior probability. If all priors are Inf, this has the effect of providing the fewest incorrect predictions assuming that the probability of observing a given class in future samples is similar to the frequency that class in the calibration set. The default [] uses all ones i.e. equal priors. '''NOTE:''' the "prior" option from older versions of the software had a bug which caused inverted behavior for this feature. The field name was changed to avoid confusion after the bug was fixed.<br />
* '''classset''': [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''algorithm''': [ 'nip' | {'sim'} | 'dspls' | 'robustpls' ] PLS algorithm to use: NIPALS, SIMPLS, DSPLS, or robust PLS.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = keep predictions, raw residuals and for Y-block only (Y-block included).<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
* '''roptions''': structure of options to pass to rsimpls (robust PLS engine from the Libra Toolbox).<br />
**: '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpls'.<br />
<br />
*'''weights''': [ {'none'} | 'hist' | 'custom' ] governs sample weighting. 'none' does no weighting. 'hist' performs histogram weighting in which large numbers of samples at individual y-values are down-weighted relative to small numbers of samples at other values. 'custom' uses the weighting specified in the weightsvect option.<br />
*'''weightsvect''': [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[class2logical]], [[compressmodel]], [[crossval]], [[discrimprob]], [[knn]], [[modelselector]], [[pls]], [[plsdaroc]], [[plsdthres]], [[preprocess]], [[simca]], [[svmda]], [[vip]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pls&diff=11003Pls2020-02-06T21:52:42Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial least squares regression for univariate or multivariate y-block.<br />
<br />
===Synopsis===<br />
<br />
:model = pls(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pls(x,model,''options'') %makes predictions with a new X-block<br />
:valid = pls(x,y,model,''options'') %makes predictions with new X- & Y-block<br />
:pls % launches analysis window with PLS selected<br />
<br />
Please note that the recommended way to build and apply a PLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PLS calculates a single partial least squares regression model using the given number of components <tt>ncomp</tt> to predict a dependent variable <tt>y</tt> from a set of independent variables <tt>x</tt>.<br />
<br />
Alternatively, PLS can be used in 'predicton mode' to apply a previously built PLS model in <tt>model</tt> to an external set of test data in <tt>x</tt> (2-way array class "double" or "dataset"), in order to generate y-values for these data. <br />
<br />
Furthermore, if matching x-block and y-block measurements are available for an external test set, then PLS can be used in 'validation mode' to predict the y-values of the test data from the model <tt>model</tt> and <tt>x</tt>, and allow comparison of these predicted y-values to the known y-values <tt>y</tt>.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = the independent variable (X-block) data (2-way array class "double" or class "dataset")<br />
* '''y''' = the dependent variable (Y-block) data (2-way array class "double" or class "dataset")<br />
* '''ncomp''' = the number of components to to be calculated (positive integer scalar)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'PLS',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''reg''': regression vector,<br />
** '''loads''': cell array with model loadings for each mode/dimension,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array),and<br />
*** the y-block predictions.<br />
** '''wts''': double array with X-block weights,<br />
** '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
** '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
** '''description''': cell array with text description of model, and<br />
** '''detail''': sub-structure with additional model details and results.<br />
<br />
* '''pred''' a structure, similar to '''model''', that contains scores, predictions, etc. for the new data.<br />
<br />
* '''valid''' a structure, similar to '''model''', that contains scores, predictions, and additional y-block statistics, etc. for the new data.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''outputversion''': [ 2 | {3} ], governs output format (see below),<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'nip' | {'sim'} | 'dspls' | 'robustpls' ], PLS algorithm to use: NIPALS, SIMPLS {default}, Direct Scores, or robust pls (with automatic outlier detection).<br />
* '''orthogonalize''': [ {'off'} | 'on' ] Orthogonalize model to condense y-block variance into first latent variable; 'on' = produce orthogonalized model. Regression vector and predictions are NOT changed by this option, only the loadings, weights, and scores. See [[orthogonalizepls]] for more information.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
*'''weights''': [ {'none'} | 'hist' | 'custom' ] governs sample weighting. 'none' does no weighting. 'hist' performs histogram weighting in which large numbers of samples at individual y-values are down-weighted relative to small numbers of samples at other values. 'custom' uses the weighting specified in the weightsvect option.<br />
*'''weightsvect''': [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.<br />
* '''roptions''': structure of options to pass to rsimpls (robust PLS engine from the Libra Toolbox).<br />
** '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpls'.<br />
<br />
The default options can be retreived using: options = pls('options');.<br />
<br />
====Outputversion====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,p,q,w,t,u,bin] = pls(x,y,ncomp,''options'')<br />
<br />
where the outputs are as defined for the [[nippls]] function. This is provided for backwards compatibility. It is recommended that users call the [[simpls]] or [[nippls]] functions directly.<br />
<br />
There is also a difference in the scores and loadings returned by the old version and the new (default) version. The old version (outputversion=2) keeps the variance in the loadings and the scores are normalized. The new version (outputversion=3) keeps the variance in the scores and has normalized loadings. The older format is related to the usage in the original algorithm publications. The newer format is used in order to maintain a standardized format across all PLS algorithms (robust PLS, and DSPLS).<br />
<br />
===Algorithm===<br />
<br />
Note that unlike previous versions of the PLS function, the default algorithm (see Options, above) is the faster SIMPLS algorithm. If the alternate NIPALS algorithm is to be used, the options.algorithm field should be set to 'nip'.<br />
<br />
Option 'robustpls' enables a robust method for Partial Least Squares Regression based on the SIMPLS algorithm. This uses the function 'rsimpls' from the well-known LIBRA Toolbox, developed by Mia Hubert's research group at the Katholieke Universiteit Leuven (kuleuven.be). The RSIMPLS method is described in: Hubert, M., and Vanden Branden, K. (2003), "Robust Methods for Partial Least Squares Regression", Journal of Chemometrics, 17, 537-549.<br />
<br />
====Studentized Residuals====<br />
From version 8.8 onwards, the Studentized Residuals shown for PLS Scores Plot are now calculated for calibration samples as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = res./sqrt(MSE.*(1-L));<br />
where res = y residual, m = number of samples, ncomp = number of LV components and L = sample leverage.<br />
This represents a constant multiplier change from how Studentized Residuals were previously calculated.<br />
For test datasets the semi-Studentized residuals are calculated as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = pres./sqrt(MSE);<br />
This represents a constant multiplier change from how the semi-Studentized Residuals were previously calculated.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[mlr]], [[modelstruct]], [[nippls]], [[pcr]], [[plsda]], [[preprocess]], [[ridge]], [[simpls]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pca&diff=11002Pca2020-02-06T21:51:56Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Perform principal components analysis.<br />
<br />
===Synopsis===<br />
<br />
<br />
:model = pca(x,ncomp,options); %identifies model (calibration step)<br />
:pred = pca(x,model,options); %projects a new X-block onto existing model<br />
:pca % Launches Analysis window with PCA selected<br />
<br />
Please note that the recommended way to build and apply a PCA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building and applying models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs a principal component analysis decomposition of the input array data returning ncomp principal components. E.g. for an ''M'' by ''N'' matrix <tt>X</tt> the PCA model is <math>X = TP^T + E</math>, where the scores matrix '''T''' is ''M'' by ''K'', the loadings matrix '''P''' is ''N'' by ''K'', the residuals matrix '''E''' is ''M'' by ''N'', and ''K'' is the number of factors or principal components <tt>ncomp</tt>. The output <tt>model</tt> is a PCA model structure. This model can be applied to new data by passing the model structure to PCA along with new data <tt>x</tt> or by using [[pcapro]].<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (2-way array class "double" or "dataset"), and<br />
<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''model''' = existing PCA model, onto which new data '''x''' is to be applied.<br />
<br />
* '''''options''''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
The output of PCA is a model structure with the following fields (see [[Standard Model Structure]] for additional information):<br />
<br />
* '''modeltype''': 'PCA',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for the input block (when blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
If the inputs are a ''M''<sub>new</sub> by ''N'' matrix newdata and and a PCA model model, then PCA applies the model to the new data. Preprocessing included in model will be applied to newdata. The output pred is structure, similar to model, that contains the new scores, and other predictions for newdata.<br />
<br />
Note: Calling pca with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting.<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''algorithm''': [ {'svd'} | 'maf' | 'robustpca' ], algorithm for decomposition. Note that algorithm 'maf' ([[maxautofactors | Maximum Autocorrelation Factors]] for hyperspectral images) requires Eigenvector's MIA_Toolbox,<br />
<br />
* '''preprocessing''': {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidencelimits.<br />
<br />
* '''roptions''': structure of options to pass to robpca (robust PCA engine from the Libra Toolbox).<br />
<br />
* '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpca'.<br />
<br />
* '''cutoff''': [] Similar to confidencelimit, this confidence level is used by the robust algorithm to indicate which sample(s) are considered outside the limits and, therefore, likely outliers. It does NOT indicate which samples were actually left out (see alpha above), but only those samples which appear to be more unusual. Default value is the same value as confidencelimit (if non-zero) or alpha (if confidencelimit is zero.)<br />
<br />
The default options can be retreived using: options = pca('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[scores,loads,ssq,res,reslm,tsqlm,tsq] = pca(xblock1,2,options);<br />
<br />
where the outputs are<br />
<br />
* '''scores''' = x-block scores,<br />
<br />
* '''loads''' = x-block loadings<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''res''' = the Q residuals,<br />
<br />
* '''reslim''' = the estimated 95% confidence limit line for Q residuals,<br />
<br />
* '''tsqlim''' = the estimated 95% confidence limit line for T<sup>2</sup>, and<br />
<br />
* '''tsq''' = the Hotelling's T<sup>2</sup> values.<br />
<br />
====PREPROCESSING====<br />
<br />
The preprocessing field can be empty [] (indicating that no preprocessing of the data should be used), or it can contain a preprocessing structure output from the PREPROCESS function. For example options.preprocessing = {preprocess('default', 'autoscale')}. This information is echoed in the output model in the model.detail.preprocessing field and is used when applying the PCA model to new data.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[evolvfa]], [[ewfa]], [[explode]], [[parafac]], [[plotloads]], [[plotscores]], [[preprocess]], [[ssqtable]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Figmerit&diff=10999Figmerit2020-02-03T15:38:25Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Analytical figures of merit for multivariate calibration.<br />
<br />
===Synopsis===<br />
<br />
:[nas,nnas,sens,sel] = figmerit(x,y,b);<br />
<br />
===Description===<br />
<br />
Calculates analytical figures of merit for PLS and PCR standard model structures. Inputs are the preprocessed (usually centered and scaled) spectral data <tt>x</tt>, the preprocessed analyte data <tt>y</tt>, and the regression vector, <tt>b</tt>. Note that for standard PLS and PCR structures <tt>b = model.reg</tt>.<br />
<br />
The outputs are the matrix of net analyte signals <tt>nas</tt> for each row of <tt>x</tt>, the norm of the net analyte signal for each row <tt>nnas</tt> (this is corrected to include the sign of the prediction), the matrix of sensitivities for each sample <tt>sens</tt>, and the vector of selectivities for each sample <tt>sel</tt> (sel is always non-negative).<br />
<br />
Note that the "noise-filtered" estimate present in previous versions is no longer used because an improved method for calculating the net analyte vector makes it redundant.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = x-block data, normally centered and scaled<br />
* '''y''' = y-block data, preprocessed<br />
* '''b''' = regression vector. Standard PLS_Toolbox PLS and PCR structures contain this vector in the <tt>.reg</tt> field.<br />
<br />
====Outputs====<br />
<br />
* '''nas''' = net analyte signals for each row of <tt>x</tt>.<br />
* '''nnas''' = norm of the net analyte signal for each row.<br />
* '''sens''' = matrix of sensitivities for each sample.<br />
* '''sel''' = vector of selectivities for each sample.<br />
<br />
<br />
===Examples===<br />
<br />
Given the 7 LV PLS model:<br />
<br />
<pre><br />
modl = pls(x,y,7);<br />
[nas,nnas,sens,sel] = figmerit(x,y,modl.reg);<br />
</pre><br />
<br />
Given the 5 PC PCR model:<br />
<br />
<pre><br />
modl = pcr(auto(x),auto(y),5);<br />
[nas,nnas,sens,sel] = figmerit(auto(x),auto(y),modl.reg);<br />
</pre><br />
<br />
===See Also===<br />
<br />
[[pcr]], [[pls]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Svmda&diff=10987Svmda2020-01-06T14:47:43Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
SVMDA Support Vector Machine (LIBSVM) for classification. Use SVM for support vector machine regression([[Svm]]).<br />
<br />
===Synopsis===<br />
<br />
:model = svmda(x,options); %identifies model (calibration step) based on x-block classes<br />
:model = svmda(x,y,options); %identifies model (calibration step)<br />
:pred = svmda(x,model,options); %makes predictions with a new X-block<br />
:pred = svmda(x,y,model,options); %performs a "test" call with a new X-block and known <br />
:svmda % Launches an analysis window with svmda as the selected method. <br />
<br />
Please note that the recommended way to build a SVMDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]].<br />
<br />
===Description===<br />
<br />
SVMDA performs calibration and application of Support Vector Machine (SVM) classification models. (Please see the svm function for support vector machine regression problems). These are non-linear models which can be used for classification problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the classification based on either the classes field of the calibration x-block or a y-block which contains integer-valued classes. It is recommended that regression be done through the [[Svm|svm]] function.<br />
<br />
Svmda is implemented using the LIBSVM package which provides both cost-support vector regression (C-SVC) and nu-support vector regression (nu-SVC). Linear and Gaussian Radial Basis Function kernel types are supported by this function.<br />
<br />
Note: Calling svmda with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing integer values,<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'SVM',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved),<br />
** '''classification''': information about the classification of X-block samples (see description at [[Standard_Model_Structure#model|Standard Model]]). For more information on class predictions, see [[Sample Classification Predictions]].,<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java). Note that the number of support vectors used is given by model.detai.svm.model.l. It is useful to check this because it can indicate overfitting if most of the calibration samples are used as support vectors, or can indicate problems fitting a model if there are no support vectors (and all prediction values will equal a constant value, a weighted mean).<br />
*** model.detail.svm.cvscan: results of CV parameter scan<br />
*** model.detail.svm.svindices: Indices of X-block samples which are support vectors.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
** '''pred''': The vector pred.pred{2} will contain the class predictions for each sample.<br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]]<br />
<br />
====Options====<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''classset''' [ {1} ], indicates which class set in x to use when no y-block is provided,<br />
* '''preprocessing''': {[]} preprocessing structures for x block (see PREPROCESS). NOTE that y-block preprocessing is NOT used with SVMDA. Any y-preprocessing will be ignored.<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the SVM model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the svmtype). Compression can make the SVM more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.<br />
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.<br />
* '''svmtype''': [ {'c-svc'} | 'nu-svc' ] Type of SVM to apply. The default is 'c-svc' for classification.<br />
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"<br />
<br />
* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 10 (seconds);<br />
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5. This option is only used when the "cvi" option is empty.<br />
* '''cvi''': {{}} Standard cross-validation cell (see crossval) defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values. If empty (the default), then random cross-validation is done based on the "splits" option.<br />
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.<br />
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.<br />
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
<br />
===Algorithm===<br />
Svmda uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options. <br />
<br />
The default SVMDA parameters cost, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
====Model building performance====<br />
Building a single SVM model can sometimes be slow, especially if the calibration dataset is large. Using ranges for the SVM parameters to search for the optimal parameter combination increases the final model building time significantly. If cross-validation is used the calculation is again increased, possibly dramatically if the number of CV subsets is large. Some suggestions for faster SVM building include: <br />
:1) Turning CV off ("none") during preliminary analyses. This is MUCH faster and cross-validation is still performed using a default "Random Subsets" with 5 data splits and 1 iteration,<br />
:2) Using a coarse grid of SVM parameter values to search over for optimal values, <br />
:3) Choosing the CV method carefully, at least initially. For example, use "Random Subsets" with a small number of data splits (e.g. 5) and a small "Number of Iterations" (e.g. 1).<br />
:4) Using the compression option if the number of variables is large.<br />
<br />
====C-SVC and nu-SVC====<br />
There are two commonly used versions of SVM classification, 'C-SVC' and 'nu-SVC'.<br />
The original SVM formulations for Classification (SVC) used parameter C, [0, inf), to apply a penalty to the optimization for data points which were not correctly separated by the classifying hyperplane. An alternative version of SVM classification was later developed where the C penalty parameter was replaced by a nu, [0,1], parameter which applies a slightly different penalty. The main motivation for the nu version of SVM is that it has a has a more meaningful interpretation because nu represents an upper bound on the fraction of training samples which are errors (misclassified) and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C.<br />
C and nu are just different versions of the penalty parameter. The same optimization problem is solved in either case. Thus it should not matter which form of SVM you use, C versus nu for classification. PLS_Toolbox uses the C version by default since this was the original formulation and is the most commonly used form. For more details on 'nu' SVMs see [http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf]<br />
<br />
The user must provide parameters (or parameter ranges) for SVM classification as:<br />
:*'C-SVC':<br />
::'''C''', (using linear kernel), or<br />
::'''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
:*'nu-SVC':<br />
::'''nu''', (using linear kernel), or<br />
::'''nu''', '''gamma''' (using radial basis function kernel),<br />
<br />
====Class prediction probabilities====<br />
LIBSVM calculates the probabilities of each sample belonging to each possible class if the "Probability Estimates" option is enabled (default setting) in the SVMDA analysis window (or if the ''probabilityestimates'' option is set equal to 1 (default value) in command line usage). The method is explained in [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf], section 8, "Probability Estimates". <br />
PLS_Toolbox provides these probability estimates in model.detail.predprobability or predict.detail.predprobability, which are nsample x nclasses arrays. The columns are the classes, in the order given by model.detail.svm.model.label (or prediction.detail.svm.model.label), where the class values are what was in the input X-block.class{1} or Y-block. These probabilities are used to find the most likely class for each sample and this is saved in pred.pred{2} and model.detail.predictedclass. This is a vector of length equal to the number of samples with values equal to class values (model.detail.class{1}).<br />
<br />
====SVMDA Parameters====<br />
<br />
* '''cost''': Cost [0 ->inf] represents the penalty associated with errors. Error refers to a sample which do not lie on the proper side of the margin for that sample's class. Increasing cost value causes closer fitting to the calibration/training data and usually a narrower margin width. ''nu'' is not required if ''cost'' is specified.<br />
* '''gamma''': Kernel ''gamma'' parameter controls the shape of the separating hyperplane. Increasing gamma usually increases number of support vectors.<br />
* '''nu''': Nu (0 -> 1] is an alternative parameter for specifying the penalty associated with errors. It indicates a lower bound on the number of support vectors to use, given as a fraction of total calibration samples, and an upper bound on the fraction of training samples which are errors (misclassified). ''cost'' is not required if ''nu'' is specified. There is a constraint on the nu parameter, however, related to the number of training data points in each class. For every class pair, having n1 and n2 data points each, nu must be less than 2*min(n1, n2)/(n1+n2), i.e. nu must be less than the ratio of the smaller class count to the pair average class count. SVMDA automatically checks for this possibility in nu-svc.<br />
<br />
===Examples of SVMDA models on simple two-class data===<br />
Users of SVMs will usually not pick the values for their SVM parameters cost/nu and gamma because there is no simple way to know what values would provide a good model for their data. Instead, they should search over parameter ranges testing SVM models to find which parameter combination works best for their data, as discussed below. However, it is still a good idea to have an idea of how these parameters affect how the SVM works on their data. For this reason we look here at the effects of cost/nu and gamma on a very simple dataset, an x-block of two variables where the data belong to just two classes, to allow visualization of the optimal separating boundary. In practice the user will usually work with multivariate x-block data having more than two variables and data belonging to multiple classes, so will only view the predicted classes versus actual classes and related skill measures, and some details such as the number of support vectors involved.<br />
<br />
The effects of the cost, gamma and nu parameters on SVMDA are examined by applying SVMDA to a simple two-variable (x1,x2) dataset where 100 samples belong to red class and 100 to blue class. This is equivalent to an X-block having dimensions 200x2. The data are distributed as three clusters, two red clusters with 50 points each which lie nearly on either side of a blue cluster which has 100 points. SVMDA attempts to draw a dividing line between these clusters separating the x1 vs x2 domain into red and blue regions. It uses these calibration data points to find the optimal separating decision boundary (hyperplane) with the widest separating margin. Any future test samples will be classified as red or blue according to which side of the separating boundary they occur on. The following images show SVMDA classification models trained on these data using an RBF kernel and varying values for the cost, gamma and nu parameters. Note that an SVMDA model with linear kernel cannot be a good model for this dataset since the red and blue points cannot be separated by a straight line, linear boundary.<br />
<br />
<gallery caption="Fig. 1. Two-class dataset" widths="400px" heights="300px" perrow="1"><br />
File:Two_class_data.png|Two-variable data with 100 red samples and 100 blue samples.<br />
</gallery><br />
<br />
The figures below show results for various SVMDA models built on the simple dataset. They are presented with the decision boundary shown as a black contour line, the margin edges shown by blue and red contours, data points which are support vectors marked by an enclosing circle, and data points which lie on the wrong side of the decision boundary (classification errors) marked with an 'x'. The decision boundary represents the zero contour of the decision function, blue and red margin edges represent the -1 and +1 contours of the decision function.<br />
<br />
====Effect of varying cost parameter for SVMDA using RBF kernel====<br />
Fig. 2a-d show the effect of increasing the cost parameter from 0.1 to 100 while gamma is kept fixed = 0.01. When the cost is small. Fig. 2a, the margin is wide since there is a small penalty associated with data points which are within the margin. Note that any point which lies within the margin or on the wrong side of the decision boundary is a support vector. Increasing the cost parameter leads to a narrowing of the margin width and fewer data points remaining within the margin, until cost = 100 (Fig. 1d) where the margin is narrow enough to avoid having any points remain inside it. Further increases in cost have no effect on the margin since no data points remain to be penalized. At the other extreme, when cost is reduced to 0.01 or smaller, the margin expands until it encloses all the data points, so all points are support vectors. This is undesirable since fewer support vectors make a more efficient model when predicting for new data points and reduces the chance of overfitting the data. In this simple case, the separating boundary in all these cases keeps approximately the same smooth contour as in Fig. 2a, so overfitting is not an issue. If there was more overlapping of the red and blue data points then larger cost parameter would cause the separating boundary to deform more and the margin edges to be much more contorted as it tries to exclude data points from the margin.<br />
<br />
<br />
<gallery caption="Fig. 2. Effect of varying ''cost'' parameter, with ''gamma'' = 0.01" widths="400px" heights="300px" perrow="2"><br />
File:C0p1g0p01.png|a) ''cost = 0.1''<br />
File:C1g0p01.png|b) ''cost = 1.0''<br />
File:C10g0p01.png|c) ''cost = 10''<br />
File:C100g0p01.png|d) ''cost = 100''<br />
</gallery><br />
<br />
====Effect of varying gamma parameter for SVMDA using RBF kernel====<br />
Fig. 3a-f show the effect of changing the gamma parameter while cost is held fixed at 1.0. These show that gamma has a major effect on how smooth or contorted the decision boundary will be, with smaller values of gamma creating a smoother decision boundary. Fig3a shows the decision boundary to be nearly linear, showing that the SVM with RBF kernel tends to the linear kernel solution for gamma values tending towards zero. At large gamma values, however, the decision boundary becomes more contorted and shows how the SVM can over-fit the calibration data. The SVM in Fig. 3f produces a decision boundary which would not be a very good class predictor for the class of new test data samples.<br />
<br />
<br />
<gallery caption="Fig. 3. Effect of varying ''gamma'' parameter, with ''cost'' = 1.0" widths="400px" heights="300px" perrow="2"><br />
File:C1g0p0001.png|a) ''gamma = 0.0001''<br />
File:C1g0p001.png|b) ''gamma = 0.001''<br />
File:C1g0p01.png|c) ''gamma = 0.01''<br />
File:C1g0p1.png|d) ''gamma = 0.1''<br />
File:C1g1.png|e) ''gamma = 1.0''<br />
File:C1g10.png|f) ''gamma = 10.0''<br />
</gallery><br />
<br />
<br />
In summary, these comparisons show that the gamma parameter controls how smooth the decision boundary will be, with larger gamma producing more complicated boundaries, while the cost parameter controls the width of the separating margin, with larger values of cost making the margin narrower. They both affect the location of the decision boundary.<br />
<br />
====Effect of varying nu parameter for SVMDA using RBF kernel====<br />
Fig. 4a-d show the effect of decreasing the nu parameter from 0.5 to 0.01 while gamma is kept fixed = 0.01. These figures show that decreasing nu has the same effect as was obtained by increasing the cost parameter, that is, it causes the margin width to decrease. It shows how nu is simply a different representation of the cost penalty parameter, and for any value of nu there is a corresponding value of cost which produces the same SVM. The reason for its use is that its value can be interpreted as a lower bound on the number of samples which are support vectors, and also as an upper bound on the number of misclassification errors.<br />
<br />
<br />
<gallery caption="Fig. 4. Effect of varying ''nu'' parameter, with ''gamma'' = 0.01" widths="400px" heights="300px" perrow="2"><br />
File:N0p5g0p01.png|a) ''nu= 0.5''<br />
File:N0p1g0p01.png|b) ''nu = 0.1''<br />
File:N0p02g0p01.png|c) ''nu = 0.02''<br />
File:N0p01g0p01.png|d) ''nu = 0.01''<br />
</gallery><br />
<br />
<br />
{| class="wikitable" border="1" style="text-align:center; width:40%;"<br />
|+ Table 1. Compare nu value to SV fraction<br />
! nu value!! SV fraction !! number of SVs<br />
|-<br />
| 0.5 || 0.505 || 101<br />
|-<br />
| 0.1 || 0.105 || 24<br />
|-<br />
| 0.02 || 0.045 || 9<br />
|-<br />
| 0.01 || 0.020 || 4<br />
|}<br />
Table 1 shows how the value of nu is a lower bound on the support vector fraction (number of SV/200), and an upper bound on the fraction of training samples which are errors (misclassified) for the SVMs in Fig. 4. The upper bound on the fraction of misclassification is easily satisfied here because the only misclassifications were three datapoints in Fig.4a.<br />
<br />
===Choosing the best SVM parameters===<br />
The recommended technique is to repeatedly test SVMDA using different parameter values and select the parameter combination which gives the best results. For SVMDA using c-svc/nu-svc and an RBF kernel we select ranges of the c/nu and gamma parameters, choosing equi-spaced (or equi-spaced in log) parameters over the ranges. SVMDA using c-svc uses 9 values of c between 0.001 and 100, and 9 values of gamma between 10^-6 and 10 by default, then tests each of these 81 pair combinations. Each test builds a c-svc model on the calibration data using 5-fold cross-validation to produce a mis-classification rate result for that test. These tests are compared over all 81 tests to find which cost/gamma value combination gives the best cross-validation prediction (smallest mis-classification). A similar approach is used for nu-svc where values of nu and gamma are specified.<br />
The results for the best model when using the simple data in Fig. 1 are shown here in Fig. 5 for the c-svc and nu-svc cases. These models were selected by searching over the default parameter ranges for the optimal model. Note, the nu parameter range was extended to smaller values than the default nu range, to include 0.05 and 0.1.<br />
<br />
<br />
<gallery caption="Fig. 5. Optimal SVMDA models for ''c-svc'' and ''nu-svc''" widths="400px" heights="300px" perrow="2"><br />
File:Csvc_opt.png|a) ''Optimal c-svc model. cost = 0.001, gamma = 0.03''<br />
File:Nusvc_opt.png|b) ''Optimal nu-svc model. nu = 0.05, gamma = 0.003''<br />
</gallery><br />
<br />
<br />
The c-svc case in Fig. 5a has a very small cost parameter and all data points are support vectors. The decision boundary looks appropriate but this is not a good solution because of the large support vector fraction. Using an SVM to predict the class of a new sample involves calculating a sum over as many terms as there are support vectors. Thus a SVM with fewer support vectors will be faster when predicting the class of a new sample. Thus it would be good to limit the lower end of the cost parameter range to 0.1 perhaps. It should also be noted that<br />
SVMDA can have problems when using very small cost parameter (or nu very close to 1) while requesting ''probability estimates'' as this can result in bad model predictions for sample class. This problem does not arise when probability estimates are not requested. The next section discusses this problem in more detail. Note that all the models presented in Figs 1-5 were built with ''probability estimates'' disabled. Thus predictions are directly given by which side of the decision boundary the data points lie on.<br />
<br />
====SVM parameter search summary plot====<br />
When SVMDA is run in the Analysis window it is possible to view the results of the SVM parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If there are two SVM parameters with ranges of values, such as cost and gamma, then a figure appears showing the performance of the model plotted against cost and gamma (Fig. 6). The measure of performance used is the misclassification rate, defined as the number of incorrectly classified samples divided by the number of classified samples, based on the cross-validation (CV) predictions for the calibration data. The lowest value of misclassification rate is marked on the plot by an "X" and this indicates the values of the SVM cost and gamma parameters which yield the best performing model. The actual SVMDA model is built using these parameter values. If you are using the command line SVMDA function to build a model then the optimal SVM parameters are shown in model.detail.svm.cvscan.best. If you are using the graphical Analysis SVMDA then the optimal parameters are reported in the summary window which is shown when you mouse-over the model icon, once the model is built.<br />
<br />
If the parameter search summary plot has the "X" marked on the edge of the plot (as in the example shown) then it is possible that re-running the analysis with additional values included for that parameter direction would lead to a more accurate optimal parameter set. For the example shown, this would suggest re-running the analysis with the Cost parameter range including values larger than 100. (However, it is unnecessary in this case since the misclassification error is already zero). Ideally the "X" should occur in the interior of the plot.<br />
<br />
<gallery caption="Fig. 6. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Svmda_paramsearch.png|CV misclassification rate as a function of SVM parameters.<br />
</gallery><br />
<br />
===Possible poor prediction from the optimal SVM model===<br />
In support vector classification (SVC) the LIBSVM package allows classification predictions to be derived two different ways.<br />
<br />
1. The standard method it to calculate the decision function for the new sample and simply assign the class label according to the sign of the decision function (in the case of two-class data). This is equivalent to saying the sample's class is determined by which side of the decision boundary it occurs on.<br />
<br />
2. The second method to predict the class of a new sample was developed in order to also provide probabilities of the sample belonging to each possible class ([http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf], section 8, "Probability Estimates"). In this method the new sample is assigned to the class for which it has the highest probability of belonging to.<br />
<br />
These two prediction methods produce nearly identical predicted class values but in certain cases there are noticeable differences. Test samples which lie very close to the decision boundary on the +1 side, for example, can be given a predicted class by the second method which identifies them incorrectly as the -1 class.<br />
This discrepancy between the two prediction methods becomes most noticeable when the SVM margin becomes very wide and encloses most data points (which are then support vectors). For the simple two-class data used here this is illustrated by comparing the two prediction methods using any gamma value but with a small very small cost (or large nu) parameter in Fig. 7 below, where again the color indicates the actual class of data points and a superimposed ''x'' indicates the predicted class is incorrect for that point. The decision boundary looks reasonable and the simpler method of identifying class by which side of the decision boundary samples occur on gives good predictions (no data points have a superimposed ''x''). The second method, Fig. 7b, completely fails, however, and predicts all samples as belonging to one class (red points are correctly predicted as red, all blue points are marked with an ''x'' indicating they are predicted incorrectly as being red. One approach to avoid such poor SVMs is to not use SVMs where most calibration samples are support vectors (i.e. the margin is very wide relative to the calibration dataset). The support vector fraction can only be checked after building the SVM, however. This problem can be avoided by not using a very small cost parameter value if using c-svc (or by not using a very large nu parameter value in nu-svc) if the ''Probability Estimates'' prediction method is used. (The nu value is a lower bound on the support vector fraction and in practice the actual support vector fraction turns out to be only slightly larger than the nu bounding value. Limiting nu to be 0.9 or smaller should avoid this problem. This is equivalent to using c-svc and using larger values for cost). <br />
<br />
<br />
<gallery caption="Fig. 7. Effect of enabling ''probability estimates'' for ''c-svc'' SVMDA" widths="400px" heights="300px" perrow="2"><br />
File:probEstsOff.png|a) ''Good c-svc model without prob. estimates. cost = 0.001, gamma = 0.01 All ''<br />
File:probEstsOn.png|b) ''Bad c-svc model with prob. estimates. cost = 0.001, gamma = 0.01''<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[svm]], [[plsda]], [[knn]], [[simca]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Xgbda&diff=10986Xgbda2020-01-03T22:42:21Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Gradient Boosted Tree Ensemble for classification (Discriminant Analysis) using XGBoost.<br />
<br />
===Synopsis===<br />
<br />
: model = xgbda(x,options); %identifies model using classes in x<br />
: model = xgbda(x,y,options); %identifies model using y for classes<br />
: pred = xgbda(x,model,options); %makes predictions with a new X-block<br />
: valid = xgbda(x,y,model,options); %performs a "test" call with a new X-block with known y-classes <br />
<br />
Please note that the recommended way to build a Gradient Boosted Tree Ensemble for classification using XGBoost model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
XGB performs calibration and application of gradient boosted decision tree models for classification. These are non-linear models which predict the probability of a test sample belonging to each of the modeled classes, hence they predict the class of a test sample.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset".<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset". If omitted in a calibration call, the x-block must be a dataset object with classes in the first mode (samples). y can always be omitted in a prediction call (when a model is passed) If y is omitted in a prediction call, x will be checked for classes. If found, these classes will be assumed to be the ones corresponding to the model.<br />
* '''model''' = previously generated model (when applying model to new data)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the xgboost model (see [[Standard Model Structure]]). Feature scores are contained in model.detail.xgb.featurescores.<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''' [ 'none' | {'final'} ] governs level of plotting.<br />
* '''waitbar''': [ off | {'on'} ] governs display of waitbar during optimization and predictions.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'xgboost' ] algorithm to use. xgboost is default and currently only option.<br />
* '''classset''' : [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''xgbtype''' : [ 'xgbr' | {'xgbc'} ] Type of XGB to apply. Default is 'xgbc' for classification, and 'xgbr' for regression. <br />
* '''compression''' : [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the XGB model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the xgbtype). Compression can make the XGB more stable and less prone to overfitting.<br />
* '''compressncomp''' : [ 1 ] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance corrected scores from compression model.<br />
<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance correctedscores from compression model.<br />
* '''cvi''' : { { 'rnd' 5 } } Standard cross-validation cell (see crossval)defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values.Alternatively, can be a vector with the same number of elements as x has rows with integer values indicating CV subsets (see crossval).<br />
* '''eta''' : Value(s) to use for XGBoost 'eta' parameter. Eta controls the learning rate of the gradient boosting.Values in range (0,1]. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [0.1, 0.3, 0.5].<br />
* '''max_depth''' : Value(s) to use for XGBoost 'max_depth' parameter. Specifies the maximum depth allowed for the decision trees. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 6 values [1 2 3 4 5 6].<br />
* '''num_round''' : Value(s) to use for XGBoost 'num_round' parameter. Specifies how many rounds of tree creation to perform. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [100 300 500].<br />
<br />
* '''strictthreshold''' : [0.5] Probability threshold for assigning a sample to a class. Affects model.classification.inclass.<br />
* '''predictionrule''' : { {'mostprobable'} | 'strict' ] governs which classification prediction statistics appear first in the confusion matrix and confusion table summaries.<br />
<br />
===Algorithm===<br />
Xgbda is implemented using the [https://xgboost.readthedocs.io XGBoost] package. User-specified values are used for XGBoost parameters (see ''options'' above). See [https://xgboost.readthedocs.io/en/latest/parameter.html XGBoost Parameters] for further details of these options. <br />
<br />
The default XGBDA parameters eta, max_depth and num_round have value ranges rather than single values. This xgbda function uses a search over the grid of appropriate parameters using cross-validation to select the optimal XGBoost parameter values and builds an XGBDA model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
===Choosing the best XGBDA parameters===<br />
The recommended technique is to repeatedly test XGBDA using different parameter values and select the parameter combination which gives the best results. XGBDA searches over ranges of parameters eta, max_depth, and num_round, by default. The actual values tested can be specified by the user by setting the associated parameter option value. Each test builds an XGBDA model on the calibration data using cross-validation to produce a mis-classification rate result for that test. These tests are compared over all tested parameter combinations to find which combination gives the best cross-validation prediction (smallest mis-classification). The XGBDA model is then built using the optimal parameter setting.<br />
<br />
====XGBDA parameter search summary plot====<br />
When XGBDA is run in the Analysis window it is possible to view the results of the XGBDA parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If at least two XGB parameters were initialized with parameter ranges, for example eta and max_depth,, then a figure appears showing the performance of the model plotted against eta and max_depth (Fig. 1). The measure of performance used is the misclassification rate, defined as the number of incorrectly classified samples divided by the number of classified samples, based on the cross-validation (CV) predictions for the calibration data. The lowest value of misclassification rate is marked on the plot by an "X" and this indicates the values of the XGBDA eta and max_depth parameters which yield the best performing model. The actual XGBDA model is built using these parameter values. If all three parameters, eta, max_depth, and num_round have ranges of values then you can view the classification performance over the other variables' ranges by clicking on the blue horizontal arrow toolbar icon above the plot. In Analysis XGBDA the optimal parameters are also reported in the model summary window which is shown when you mouse-over the model icon, once the model is built. If you are using the command line XGBDA function to build a model then the optimal XGBDA parameters are shown in model.detail.xgb.cvscan.best. <br />
<gallery caption="Fig. 1. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_survey.png|Misclassification as a function of XGB parameters.<br />
</gallery><br />
<br />
===Variable Importance Plot===<br />
The ease of interpreting single decision trees is lost when a sequence of boosted trees is used, as in XGBoost. One commonly used diagnostic quantity for interpreting boosted trees is the "feature importance", or "variable importance" in PLS_Toolbox terminology. This is a measure of each variable's importance to the tree ensemble construction. It is calculated for each variable by summing up the “gain” on each node where that variable was used for splitting, over all trees in the sequence. "gain" refers to the reduction in the loss function being optimized. The important variables are shown in the XGBDA Analysis window when the model is built, ranked by their importance (Fig. 2). <br />
<gallery caption="Fig. 2. Variable importance plot" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_varimp.png|XGBDA variable importance. Right-click in the plot area to copy the indices of the important variables. Clicking on the "Plot" button opens a version of the plot which can be zoomed or panned.<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[knn]], [[lwr]], [[pls]], [[plsda]], [[xgb]], [[xgbengine]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Xgb&diff=10985Xgb2020-01-03T22:40:48Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Gradient Boosted Tree Ensemble for regression using XGBoost.<br />
<br />
===Synopsis===<br />
<br />
:model = xgb(x,y,options); %identifies model (calibration step)<br />
:pred = xgb(x,model,options); %makes predictions with a new X-block<br />
:valid = xgb(x,y,model,options); %performs a "test" call with a new X-block and known y-values<br />
<br />
Please note that the recommended way to build a Gradient Boosted Tree Ensemble for regression using XGBoost model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
To choose between regression and classification, use the xgbtype option:<br />
:: regression : xgbtype = 'xgbr'<br />
:: classification : xgbtype = 'xgbc'<br />
It is recommended that classification be done through the xgbda function.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset",<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset",<br />
* '''model''' = previously generated model (when applying model to new data)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the xgboost model (see [[Standard Model Structure]]). Feature scores are contained in model.detail.xgb.featurescores.<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''' [ 'none' | {'final'} ] governs level of plotting.<br />
* '''waitbar''': [ off | {'on'} ] governs display of waitbar during optimization and predictions.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'xgboost' ] algorithm to use. xgboost is default and currently only option.<br />
* '''classset''' : [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''xgbtype''' : [ {'xgbr'} | 'xgbc' ] Type of XGB to apply. Default is 'xgbc' for classification, and 'xgbr' for regression. <br />
* '''compression''' : [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the XGB model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the xgbtype). Compression can make the XGB more stable and less prone to overfitting.<br />
* '''compressncomp''' : [ 1 ] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''' : [ 'no' |{'yes'}] Use Mahalnobis Distance correctedscores from compression model.<br />
* '''cvi''' : { { 'rnd' 5 } } Standard cross-validation cell (see crossval)defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values.Alternatively, can be a vector with the same number of elements as x has rows with integer values indicating CV subsets (see crossval).<br />
* '''eta''' : Value(s) to use for XGBoost 'eta' parameter. Eta controls the learning rate of the gradient boosting.Values in range (0,1]. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [0.1, 0.3, 0.5].<br />
* '''max_depth''' : Value(s) to use for XGBoost 'max_depth' parameter. Specifies the maximum depth allowed for the decision trees. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 6 values [1 2 3 4 5 6].<br />
* '''num_round''' : Value(s) to use for XGBoost 'num_round' parameter. Specifies how many rounds of tree creation to perform. Using a single value specifies the value to use. Using a range of values specifies the parameters to search over to find the optimal value. Default is 3 values [100 300 500].<br />
<br />
===Algorithm===<br />
Xgb is implemented using the [https://xgboost.readthedocs.io XGBoost] package. User-specified values are used for XGBoost parameters (see ''options'' above). See [https://xgboost.readthedocs.io/en/latest/parameter.html XGBoost Parameters] for further details of these options. <br />
<br />
The default XGB parameters eta, max_depth and num_round have value ranges rather than single values. This xgb function uses a search over the grid of appropriate parameters using cross-validation to select the optimal XGBoost parameter values and builds an XGB model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however.<br />
<br />
===Choosing the best XGB parameters===<br />
The recommended technique is to repeatedly test XGB using different parameter values and select the parameter combination which gives the best results. XGB searches over ranges of parameters eta, max_depth, and num_round, by default. The actual values tested can be specified by the user by setting the associated parameter option value. Each test builds an XGB model on the calibration data using cross-validation to produce root mean square error (RMSECV) result for that test. These tests are compared over all tested parameter combinations to find which combination gives the best cross-validation prediction (smallest RMSECV). The XGB model is then built using the optimal parameter setting.<br />
<br />
====XGB parameter search summary plot====<br />
When XGB is run in the Analysis window it is possible to view the results of the XGB parameter search by clicking on the "Variance Captured" plot icon in the toolbar. If at least two XGB parameters were initialized with parameter ranges, for example eta and max_depth, then a figure appears showing the performance of the model plotted against eta and max_depth (Fig. 1). The measure of performance used is the root mean square error based on the cross-validation predictions predictions for the calibration data (RMSECV). The lowest value of RMSECV is marked on the plot by an "X" and this indicates the values of the XGB eta and max_depth parameters which yield the best performing model. The actual XGB model is built using these parameter values. If all three parameters, eta, max_depth, and num_round have ranges of values then you can view the prediction performance over the other variables' ranges by clicking on the blue horizontal arrow toolbar icon above the plot. In Analysis XGB the optimal parameters are also reported in the model summary window which is shown when you mouse-over the model icon, once the model is built. If you are using the command line XGB function to build a model then the optimal XGB parameters are shown in model.detail.xgb.cvscan.best. <br />
<br />
<gallery caption="Fig. 1. Parameter search summary" widths="450px" heights="300px" perrow="1"><br />
File:Xgb_survey.png|RMSECV as a function of XGB parameters.<br />
</gallery><br />
<br />
===Variable Importance Plot===<br />
The ease of interpreting single decision trees is lost when a sequence of boosted trees is used, as in XGBoost. One commonly used diagnostic quantity for interpreting boosted trees is the "feature importance", or "variable importance" in PLS_Toolbox terminology. This is a measure of each variable's importance to the tree ensemble construction. It is calculated for each variable by summing up the “gain” on each node where that variable was used for splitting, over all trees in the sequence. "gain" refers to the reduction in the loss function being optimized.<br />
The important variables are shown in the XGB Analysis window when the model is built, ranked by their importance (Fig. 2). <br />
<gallery caption="Fig. 2. Variable importance plot" widths="450px" heights="300px" perrow="1"><br />
File:Xgbda_varimp.png|XGB variable importance. Right-click in the plot area to copy the indices of the important variables. Clicking on the "Plot" button opens a version of the plot which can be zoomed or panned.<br />
</gallery><br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[knn]], [[lwr]], [[pls]], [[plsda]], [[xgbda]], [[xgbengine]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Svm&diff=10984Svm2020-01-03T22:31:45Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
SVM Support Vector Machine (LIBSVM) for regression. Use SVMDA for SVM classification ([[Svmda]]). Please also look at the [[Svmda]] page since it has more detailed information much of which also applies to SVM for regression.<br />
<br />
===Synopsis===<br />
<br />
:model = svm(x,y,options); %identifies model (calibration step).<br />
:pred = svm(x,model,options); %makes predictions with a new X-block<br />
:pred = svm(x,y,model,options); %performs a "test" call with a new X-block and known y-values<br />
:svm % Launches an Analysis window with SVM as the selected method.<br />
<br />
Please note that the recommended way to build a SVM model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
The SVM function or analysis method performs calibration and application of Support Vector Machine (SVM) regression models. SVM models can be used for regression problems. The model consists of a number of support vectors (essentially samples selected from the calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the continuous y-block variable. It is recommended that classification be done through the svmda function.<br />
<br />
Svm is implemented using the LIBSVM package which provides both epsilon-support vector regression (epsilon-SVR) and nu-support vector regression (nu-SVR). Linear and Gaussian Radial Basis Function kernel types are supported by this function.<br />
<br />
Note: Calling svm with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'SVM',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.svm.model: Matlab version of the libsvm svm_model (Java). Note that the number of support vectors used is given by model.detai.svm.model.l. It is useful to check this because it can indicate overfitting if most of the calibration samples are used as support vectors, or can indicate problems fitting a model if there are no support vectors (and all prediction values will equal a constant value, a weighted mean).<br />
*** model.detail.svm.cvscan: Results of CV parameter scan<br />
*** model.detail.svm.svindices: Indices of X-block samples which are support vectors.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
===Options===<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the SVM model. 'pca' uses a simple PCA model to compress the information. 'pls' uses either a pls or plsda model (depending on the svmtype). Compression can make the SVM more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
* '''algorithm''': [ 'libsvm' ] algorithm to use. libsvm is default and currently only option.<br />
* '''kerneltype''': [ 'linear' | {'rbf'} ], SVM kernel to use. 'rbf' is default.<br />
* '''svmtype''': [ {'epsilon-svr'} | 'nu-svr' ] Type of SVM to apply. The default is 'epsilon-svr' for regression.<br />
* '''probabilityestimates''': [0| {1} ], whether to train the SVR model for probability estimates, 0 or 1 (default 1)"<br />
<br />
* '''cvtimelimit''': Set a time limit (seconds) on individual cross-validation sub-calculation when searching over supplied SVM parameter ranges for optimal parameters. Only relevant if parameter ranges are used for SVM parameters such as cost, epsilon, gamma or nu. Default is 10;<br />
* '''splits''': Number of subsets to divide data into when applying n-fold cross validation. Default is 5. This option is only used when the "cvi" option is empty.<br />
* '''cvi''': {{}} Standard cross-validation cell (see crossval) defining a split method, number of splits, and number of iterations. This cross-validation is use both for parameter optimization and for error estimate on the final selected parameter values. If empty (the default), then random cross-validation is done based on the "splits" option.<br />
<br />
* '''gamma''': Value(s) to use for LIBSVM kernel gamma parameter. Default is 15 values from 10^-6 to 10, spaced uniformly in log.<br />
* '''cost''': Value(s) to use for LIBSVM 'c' parameter. Default is 11 values from 10^-3 to 100, spaced uniformly in log.<br />
* '''epsilon''': Value(s) to use for LIBSVM 'p' parameter (epsilon in loss function). Default is the set of values [1.0, 0.1, 0.01].<br />
* '''nu''': Value(s) to use for LIBSVM 'n' parameter (nu of nu-SVC, and nu-SVR). Default is the set of values [0.2, 0.5, 0.8].<br />
<br />
===Algorithm===<br />
Svm uses the LIBSVM implementation using the user-specified values for the LIBSVM parameters (see ''options'' above). See [http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf] for further details of these options. <br />
<br />
The default SVM parameters cost, epsilon, nu and gamma have value ranges rather than single values. This svm function uses a search over the grid of appropriate parameters using cross-validation to select the optimal SVM parameter values and builds an SVM model using those values. This is the recommended usage. The user can avoid this grid-search by passing in single values for these parameters, however. If you are using the command line SVM function to build a model then the optimal SVM parameters are shown in model.detail.svm.cvscan.best. If you are using the graphical Analysis SVM then the optimal parameters are reported in the summary window which is shown when you mouse-over the model icon, once the model is built.<br />
<br />
====Model building performance====<br />
Building a single SVM model can sometimes be slow, especially if the calibration dataset is large. Using ranges for the SVM parameters to search for the optimal parameter combination increases the final model building time significantly. If cross-validation is used the calculation is again increased, possibly dramatically if the number of CV subsets is large. Some suggestions for faster SVM building include: <br />
:1) Turning CV off ("none") during preliminary analyses. This is MUCH faster and cross-validation is still performed using a default "Random Subsets" with 5 data splits and 1 iteration,<br />
:2) Using a coarse grid of SVM parameter values to search over for optimal values, <br />
:3) Choosing the CV method carefully, at least initially. For example, use "Random Subsets" with a small number of data splits (e.g. 5) and a small "Number of Iterations" (e.g. 1).<br />
:4) Using the compression option if the number of variables is large.<br />
<br />
====epsilon-SVR and nu-SVR====<br />
There are two commonly used versions of SVM regression, 'epsilon-SVR' and 'nu-SVR'. The original SVM formulations for Regression (SVR) used parameters C [0, inf) and epsilon[0, inf) to apply a penalty to the optimization for points which were not correctly predicted. An alternative version of both SVM regression was later developed where the epsilon penalty parameter was replaced by an alternative parameter, nu [0,1], which applies a slightly different penalty. The main motivation for the nu versions of SVM is that it has a has a more meaningful interpretation. This is because nu represents an upper bound on the fraction of training samples which are errors (badly predicted) and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C or epsilon.<br />
Epsilon or nu are just different versions of the penalty parameter. The same optimization problem is solved in either case. Thus it should not matter which form of SVM you use, epsilon or nu. PLS_Toolbox uses epsilon since this was the original formulation and is the most commonly used form. For more details on 'nu' SVM regression see [http://www.csie.ntu.edu.tw/~cjlin/papers/newsvr.pdf]<br />
<br />
The user must provide parameters (or parameter ranges) for SVM regression as:<br />
:*'epsilon-SVR':<br />
::'''epsilon''','''C''', (using linear kernel), or<br />
::'''epsilon''','''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
:*'nu-SVR':<br />
::'''nu''', '''C''', (using linear kernel), or<br />
::'''nu''', '''C''', '''gamma''' (using radial basis function kernel),<br />
<br />
====SVM Parameters====<br />
<br />
* '''cost''': Cost [0 ->inf] represents the penalty associated with errors larger than epsilon. Increasing cost value causes closer fitting to the calibration/training data.<br />
* '''gamma''': Kernel ''gamma'' parameter controls the shape of the separating hyperplane. Increasing gamma usually increases number of support vectors.<br />
* '''epsilon''': In training the regression function there is no penalty associated with points which are predicted within distance epsilon from the actual value. Decreasing epsilon forces closer fitting to the calibration/training data.<br />
* '''nu''': Nu (0 -> 1] indicates a lower bound on the number of support vectors to use, given as a fraction of total calibration samples, and an upper bound on the fraction of training samples which are errors (poorly predicted).<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[ann]], [[mlr]], [[lwr]], [[pls]], [[pcr]], [[svmda]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Plsda&diff=10983Plsda2020-01-03T22:27:53Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial least squares discriminant analysis.<br />
<br />
===Synopsis===<br />
:plsda - Launches an Analysis window with the PLSDA method selected<br />
:model = plsda(x,y,ncomp,''options'')<br />
:model = plsda(x,ncomp,''options'')<br />
:pred = plsda(x,model,''options'')<br />
:valid = plsda(x,y,model,''options'')<br />
<br />
Please note that the recommended way to build a PLSDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PLSDA is a multivariate inverse least squares discrimination method used to classify samples. The y-block in a PLSDA model indicates which samples are in the class(es) of interest through either:<br />
<br />
*(A) a column vector of class numbers indicating class assignments:<br />
<br />
y = [1 1 3 2]';<br />
<br />
:'''NOTE:''' if classes are assigned in the input (x), y can be omitted and this option will be assumed using the first class set of the x-block rows (or other set if the option "classset" is used). For information on assigning classes to the X-block, see [[Assigning Sample Classes]].<br />
<br />
*(B) a matrix of one or more columns containing a logical zero (= not in class) or one (= in class) for each sample (row):<br />
<pre><br />
y = [1 0 0;<br />
1 0 0;<br />
0 0 1;<br />
0 1 0]<br />
</pre><br />
<br />
:'''NOTE''': When a vector of class numbers is used (case A, above), class zero (0) is reserved for "unknown" samples and, thus, samples of class zero are never used when calibrating a PLSDA model. The model will include predictions for these samples.<br />
<br />
====Probability-based Predictions====<br />
The raw predictions from a PLSDA model is a value of nominally zero or one. A value closer to zero indicates the new sample is NOT in the modeled class; a value of one indicates a sample is in the modeled class. In practice a threshold between zero and one is determined above which a sample is in the class and below which a sample is not in the class (See, for example, [[plsdthres]]). Similarly, a probability of a sample being inside or outside the class can be calculated using [[discrimprob]]. The predicted probability of each class as well as class assignments made with various rules can be found in the field:<br />
<br />
:model.classification<br />
<br />
For more details, see [[Sample Classification Predictions]], and the description of the model's classification field in the [[Standard Model Structure]].<br />
<br />
====Threshold-based Predictions====<br />
It is possible to see the classification results based on the sample prediction relative to the threshold for that class. These can differ slightly from the predictions based on probabilities. The probability-based predictions are likely to be more accurate in situations where one class is narrowly distributed in y-prediction range but other classes are broadly distributed and so are more probable for y-prediction values far from the narrow class probable y range (see [http://www.eigenvector.com/faq/index.php?id=38]). <br />
* In the PLSDA Analysis window the threshold-based classification results can viewed by using the menu: "Tools"->"Show Details"->"Model", or by mousing over the model icon. This reports the Sensitivity, Specificity, Class Error for each modeled class. The "Class Err." is defined as the mean of the false positive and false positive rates. (see [https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Definitions definitions]).<br />
* For command line usage these are found in the model object as model.detail.misclassed, a cell array containing a matrix for each class, and model.detail.classerrc. For class j:<br />
<br />
: False positive rate (1 - specificity): model.detail.misclassedc{j}(1, ncomp)<br />
: False negative rate (1 - sensitivity): model.detail.misclassedc{j}(2, ncomp), where ncomp = number of latent variables used in model.<br />
: Class Error: model.detail.classerrc(j, ncomp)<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block), class "double" or "dataset",<br />
* '''y''' = Y-block <br />
** OPTIONAL if '''x''' is a dataset containing classes for sample mode (mode 1)<br />
** otherwise, '''y''' is one of the following:<br />
***(A) column vector of sample classes for each sample in '''x''' <br />
***(B) a logical array with '1' indicating class membership for each sample (rows) in one or more classes (columns), or <br />
***(C) a cell array of class groupings of classes from the x-block data. For example: <tt> {[1 2] [3]} </tt> would model classes 1 and 2 as a single group against class 3.<br />
* '''ncomp''' = the number of latent variables to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' = an optional input options structure (see below)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure containing the PLSDA model (See [[Standard Model Structure]]).<br />
* '''pred''' = structure array with predictions<br />
* '''valid''' = structure array with predictions, includes known class information (Y block data) of test samples<br />
<br />
Note: Calling '''plsda''' with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]].<br />
<br />
===Options===<br />
<br />
''options'' = a structure that can contain the following fields:<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''orthogonalize''': [ {'off'} | 'on' ] Orthogonalize model to condense y-block variance into first latent variable; 'on' = produce orthogonalized model. Regression vector and predictions are NOT changed by this option, only the loadings, weights, and scores. See [[orthogonalizepls]] for more information.<br />
* '''priorprob''': [ ] Vector of prior probabilities of observing each class. If any class prior is "Inf", the frequency of observation of that class in the calibration is used as its prior probability. If all priors are Inf, this has the effect of providing the fewest incorrect predictions assuming that the probability of observing a given class in future samples is similar to the frequency that class in the calibration set. The default [] uses all ones i.e. equal priors. '''NOTE:''' the "prior" option from older versions of the software had a bug which caused inverted behavior for this feature. The field name was changed to avoid confusion after the bug was fixed.<br />
* '''classset''': [ 1 ] indicates which class set in x to use when no y-block is provided.<br />
* '''algorithm''': [ 'nip' | {'sim'} | 'dspls' | 'robustpls' ] PLS algorithm to use: NIPALS, SIMPLS, DSPLS, or robust PLS.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = keep predictions, raw residuals and for Y-block only (Y-block included).<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
* '''roptions''': structure of options to pass to rsimpls (robust PLS engine from the Libra Toolbox).<br />
**: '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpls'.<br />
<br />
*'''weights''': [ {'none'} | 'hist' | 'custom' ] governs sample weighting. 'none' does no weighting. 'hist' performs histogram weighting in which large numbers of samples at individual y-values are down-weighted relative to small numbers of samples at other values. 'custom' uses the weighting specified in the weightsvect option.<br />
*'''weightsvect''': [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[class2logical]], [[compressmodel]], [[crossval]], [[discrimprob]], [[knn]], [[modelselector]], [[pls]], [[plsdaroc]], [[plsdthres]], [[preprocess]], [[simca]], [[svmda]], [[vip]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Parafac&diff=10982Parafac2020-01-03T22:27:00Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
PARAFAC (PARAllel FACtor analysis) for multi-way arrays<br />
<br />
===Synopsis===<br />
<br />
:model = parafac(X,ncomp,''initval,options'')<br />
:pred = parafac(Xnew,model)<br />
:parafac % Launches an analysis window with Parafac as the selected method<br />
<br />
Please note that the recommended way to build a PARAFAC model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PARAFAC will decompose an array of order ''N'' (where ''N'' >= 3) into the summation over the outer product of ''N'' vectors (a low-rank model). E.g. if ''N''=3 then the array is size ''I'' by ''J'' by ''K''. An example of three-way fluorescence data is shown below..<br />
<br />
For example, twenty-seven samples containing different amounts of dissolved hydroquinone, tryptophan, phenylalanine, and dopa are measured spectrofluoremetrically using 233 emission wavelengths (250-482 nm) and 24 excitation wavelengths (200-315 nm each 5 nm). A typical sample is also shown.<br />
<br />
[[Image:Parafacdata.gif]]<br />
<br />
A four-component PARAFAC model of these data will give four factors, each corresponding to one of the chemical analytes. This is illustrated graphically below. The first mode scores (loadings in mode 1) in the matrix '''A''' (27x4) contain estimated relative concentrations of the four analytes in the 27 samples. The second mode loadings '''B''' (233x4) are estimated emission loadings and the third mode loadings '''C''' (24x4) are estimated excitation loadings.<br />
<br />
[[Image:Parafacresults.gif]]<br />
<br />
For more information about how to use PARAFAC, see the [http://www.youtube.com/user/QualityAndTechnology/videos?view=1&flow=grid University of Copenhagen's Multi-Way Analysis Videos].<br />
<br />
In the PARAFAC algorithm, any missing values must be set to NaN or Inf and are then automatically handled by expectation maximization. This routine employs an alternating least squares (ALS) algorithm in combination with a line search. For 3-way data, the initial estimate of the loadings is usually obtained from the tri-linear decomposition (TLD).<br />
<br />
For assistance in preparing batch data for use in PARAFAC please see [[bspcgui]].<br />
<br />
====Inputs====<br />
<br />
* '''x''' = the multiway array to be decomposed, and<br />
<br />
* '''ncomp''' = <br />
:* the number of factors (components) to use, OR<br />
:* a cell array of parameters such as {a,b,c} which will then be used as starting point for the model. The cell array must be the same length as the number of modes and element j contain the scores/loadings for that mode. If one cell element is empty, this mode is guessed based on the remaining modes.<br />
<br />
====Optional Inputs====<br />
<br />
* '''''initval''''' = <br />
:* If a parafac model is input, the data are fit to this model where the loadings for the first mode (scores) are estimated. <br />
:* If the loadings are input (e.g. model.loads) these are used as starting values.<br />
<br />
*'''''options''''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
The output model is a structure array with the following fields:<br />
<br />
* '''modeltype''': 'PARAFAC',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': 1 by ''K'' cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for each input data block,<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.<br />
The output pred is a structure array that contains the approximation of the data if the options field blockdetails is set to 'all' (see next).<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ {'on'} | 'off' ], governs level of display,<br />
<br />
* '''plots''': [ {'final'} | 'all' | 'none' ], governs level of plotting,<br />
<br />
* '''weights''': [], used for fitting a weighted loss function (discussed below),<br />
<br />
* '''stopcriteria''': Structure defining when to stop iterations based on any one of four criteria<br />
<br />
:* '''relativechange''': Default is 1e-6. When the relative change in fit gets below the threshold, the algorithm stops.<br />
:* '''absolutechange''': Default is 1e-6. When the absolute change in fit gets below the threshold, the algorithm stops.<br />
:* '''iterations''': Default is 10.000. When the number of iterations exceeds the threshold, the algorithm stops.<br />
:* '''seconds''': Default is 3600 (seconds). When the time spent exceeds the threshold, the algorithm stops.<br />
<br />
* '''init''': [ 0 ], defines how parameters are initialized (discussed below),<br />
<br />
* '''line''': [ 0 | {1}] defines whether to use the line search {default uses it},<br />
<br />
* '''algo''': [ {'ALS'} | 'tld' | 'swatld' ] governs algorithm used. Only ALS allows more than three-way and allows constraints,<br />
<br />
* '''iterative''': settings for iterative reweighted least squares fitting (see help on weights below),<br />
<br />
* '''validation.splithalf''': [ 'on' | {'off'} ], Allows doing [[splithalf]] analysis. See the help of SPLITHALF for more information,<br />
<br />
* '''auto_outlier.perform''': [ 'on' | {'off'} ], Will automatically remove detected outliers in an iterative fashion. See auto_outlier.help for more information,<br />
<br />
* '''scaletype''': Defines how loadings are scaled. See options.scaletype.text for help,<br />
<br />
* '''blockdetails''': [ {'standard'} | 'compact' | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = like 'Standard' only residual limits from old model is used and the core consistency field in the model structure is left empty. ('model.detail.reslim', 'model.detail.coreconsistency.consistency').<br />
:* 'All' = keep predictions, raw residuals for x-block as well as the X-block dataset itself.<br />
<br />
* '''preprocessing''': {[]}, one element cell array containing preprocessing structure (see PREPROCESS) defining preprocessing to use on the x-block <br />
<br />
* '''samplemode''': [1], defines which mode should be considered the sample or object mode,<br />
<br />
* '''constraints''': {3x1 cell}, defines constraints on parameters (discussed below),<br />
<br />
* '''coreconsist''': [ {'on'} | 'off' ], governs calculation of core consistency (turning off may save time with large data sets and many components), and<br />
<br />
* '''waitbar''': [ {'on'} | 'off' ], display waitbar. <br />
<br />
The default options can be retrieved using: options = parafac('options');.<br />
<br />
=====Weights=====<br />
<br />
Through the use of the ''options'' field weights it is possible to fit a PARAFAC model in a weighted least squares sense The input is an array of the same size as the input data X holding individual weights for each element. The PARAFAC model is then fit in a weighted least squares sense. Instead of minimizing the frobenius norm ||x-M||<sup>2</sup> where M is the PARAFAC model, the norm ||(x-M).*weights||<sup>2</sup> is minimized. The algorithm used for weighted regression is based on a majorization step according to Kiers, ''Psychometrika'', '''62''', 251-266, 1997 which has the advantage of being computationally inexpensive.<br />
<br />
=====Init=====<br />
<br />
The ''options'' field init is used to govern how the initial guess for the loadings is obtained. If optional input ''initval'' is input then options.init is not used. The following choices for init are available.<br />
<br />
Generally, options.init = 0, will do for well-behaved data whereas options.init = 10, will be suitable for difficult models. Difficult models are typically those with many components, with very correlated loadings, or models where there are indications that local minima are present.<br />
<br />
* '''init''' = 0, PARAFAC chooses initialization {default},<br />
<br />
* '''init''' = 1, uses TLD (unless data is more than three-way. Then ATLD is used),<br />
<br />
* '''init''' = 2, based on singular value decomposition (good alternative to 1), <br />
<br />
* '''init''' = 3, based on orthogonalization of random values (good for checking local minima),<br />
<br />
* '''init''' = 4, based on approximate (sequentially fitted) PARAFAC model, <br />
<br />
* '''init''' = 5, based on compression which may be useful for large data, and<br />
<br />
* '''init''' > 5, based on best fit of many (the value options.init) small runs.<br />
<br />
=====Constraints=====<br />
<br />
The ''options'' field constraints is used to employ constraints on the parameters. It is a cell array with number of elements equal to the number of modes of the input data X. Each cell contains a structure array that defines the constraints in that particular mode. Hence, options.constraints{2} defines constraints on the second mode loadings. For help on setting constraints see [[constrainfit]]. Note, that if your dataset is e.g. a five-way array, then the default constraint field in options only defines the first three modes. You will have to make the constraint field for the remaining modes yourself. This can be done by copying from the other modes. For example, options.constraints{4} = options.constraints{1};options.constraints{5} = options.constraints{1};<br />
<br />
===Examples===<br />
<br />
parafac demo gives a demonstration of the use of the PARAFAC algorithm.<br />
<br />
model = parafac(X,5) fits a five-component PARAFAC model to the array X using default settings.<br />
<br />
pred = parafac(Z,model) fits a parafac model to new data Z. The scores will be taken to be in the first mode, but you can change this by setting options.samplemodex to the mode which is the sample mode. Note, that the sample-mode dimension may be different for the old model and the new data, but all other dimensions must be the same.<br />
<br />
options = parafac('options'); generates a set of default settings for PARAFAC. options.plots = 0; sets the plotting off.<br />
<br />
options.init = 3; sets the initialization of PARAFAC to orthogonalized random numbers.<br />
<br />
options.samplemodex = 2; Defines the second mode to be the sample-mode. Useful, for example, when fitting an existing model to new data has to provide the scores in the second mode.<br />
<br />
model = parafac(X,2,options); fits a two-component PARAFAC model with the settings defined in options. <br />
<br />
parafac io shows the I/O of the algorithm.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[bspcgui]], [[datahat]], [[eemoutlier]], [[explode]], [[gram]], [[mpca]], [[npls]], [[outerm]], [[parafac2]], [[pca]], [[preprocess]], [[splithalf]], [[tld]], [[tucker]], [[unfoldm]], [[modelviewer]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Npls&diff=10981Npls2020-01-03T22:26:12Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multilinear-PLS (N-PLS) for true multi-way regression.<br />
<br />
===Synopsis===<br />
<br />
:model = npls(x,y,ncomp,''options'')<br />
:pred = npls(x,ncomp,model,''options'')<br />
<br />
Please note that the recommended way to build a N-PLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
NPLS fits a multilinear PLS1 or PLS2 regression model to x and y [R. Bro, J. Chemom., 1996, 10(1), 47-62]. The NPLS function also can be used for calibration and prediction.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block,<br />
<br />
* '''y''' = Y-block, and<br />
<br />
* '''ncomp''' = the number of factors to compute, or<br />
<br />
* '''model''' = in prediction mode, this is a structure containing a NPLS model.<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = standard model structure (see: [[Standard Model Structure]]) with the following fields:<br />
<br />
* '''modeltype''': 'NPLS',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''reg''': cell array with regression coefficients,<br />
<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
<br />
* '''core''': cell array with the NPLS core,<br />
<br />
* '''pred''': cell array with model predictions for each input data block,<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
* '''''options''''' = options structure containing the fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting,<br />
<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
<br />
* '''outputregrescoef''': if this is set to 0 no regressions coefficients associated with the X-block directly are calculated (relevant for large arrays), and<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is like 'standard' but the residual limits in the model structure are also left empty (.model.detail.reslim.lim95, model.detail.reslim.lim99).<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[conload]], [[datahat]], [[explode]], [[gram]], [[modlrder]], [[mpca]], [[crossval]], [[outerm]], [[parafac]], [[parafac2]], [[pls]], [[tld]], [[unfoldm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mpca&diff=10980Mpca2020-01-03T22:22:46Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multi-way (unfold) principal components analysis.<br />
<br />
===Synopsis===<br />
<br />
:model = mpca(mwa,ncomp,''options'')<br />
:model = mpca(mwa,ncomp,preprostring)<br />
:pred = mpca(mwa,model,''options'')<br />
:mpca - Launches an analysis window with MPCA as the selected method.<br />
<br />
Please note that the recommended way to build a MPCA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Principal Components Analysis of multi-way data using unfolding to a 2-way matrix followed by conventional PCA.<br />
<br />
Inputs to MPCA are the multi-way array mwa (class "double" or "dataset") and the number of components to use in the model nocomp. To make predictions with new data the inputs are the multi-way array mwa and the MPCA model model. Optional input ''options'' is discussed below.<br />
<br />
For assistance in preparing batch data for use in MPCA please see [[bspcgui]].<br />
<br />
The output model is a structure array with the following fields:<br />
<br />
* '''modeltype''': 'MPCA',<br />
<br />
* '''datasource''': structure array with information about the x-block,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': 1 by 2 cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for each input data block (this is empty if options.blockdetail = 'normal'),<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
'''options''' = a structure array with the following fields.<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting,<br />
<br />
* '''outputversion''': [ 2 | {3} ] governs output format,<br />
<br />
* '''preprocessing''': { [] } preprocessing structure, {default is mean centering i.e. options.preprocessing = preprocess('default', 'mean center')} (see PREPROCESS),<br />
<br />
* '''algorithm''': [ {'svd'} | 'maf' | 'robustpca' ], algorithm for decomposition, Algorithm 'maf' requires Eigenvector's MIA_Toolbox.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidencelimits.<br />
<br />
* '''roptions''': structure of options to pass to robpca (robust PCA engine from the Libra Toolbox).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''samplemode''': [ {3} ] mode (dimension) to use as the sample mode e.g. if it is 3 then it is assumed that mode 3 is the sample/object dimension i.e. if mwa is 7x9x10 then the scores model.loads{1} will have 10 rows (it will be 10xncomp).<br />
<br />
The default options can be retreived using: options = mpca('options');.<br />
<br />
It is also possible to input just the preprocessing option as an ordinary string in place of ''options'' and have the remainder of options filled in with the defaults from above. The following strings are valid:<br />
<br />
: ''''none'''': no scaling,<br />
<br />
: ''''auto'''': unfolds array then applies autoscaling,<br />
<br />
: ''''mncn'''': unfolds array then applies mean centering, or<br />
<br />
: ''''grps'''': {default} unfolds array then group/block scales each variable, i.e. the same variance scaling is used for each variable along its time trajectory (see GSCALE).<br />
<br />
MPCA will work with arrays of order 3 and higher. For higher order arrays, the last order is assumed to be the sample order, ''i.e.'' for an array of order ''n'' with the dimension of order ''n'' being ''m'', the unfolded matrix will have ''m'' samples. For arrays of higher order the group scaling option will group together all data with the same order 2 index, for multiway array mwa, each mwa(:,j,:, ... ,:) will be scaled as a group.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[bspcgui]], [[evolvfa]], [[ewfa]], [[explode]], [[npls]], [[parafac]], [[parafac2]], [[pca]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Modelselector&diff=10979Modelselector2020-01-03T22:21:31Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Create or apply a Model Selector model.<br />
<br />
===Synopsis===<br />
<br />
:model = modelselector(triggermodel,target_1,target_2,...,target_default);<br />
:[target_model,applymodel] = modelselector(data,model)<br />
:modelselector % Launches the modelselector tool<br />
<br />
Please note that the recommended way to build a Model Selector model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
A Model Selector Model is a special model type which, when applied to new data, selects between two or more "target" models or outputs (a.k.a. end-point nodes) based on a "trigger" model (a.k.a. rule or decision node). These models are used to implement discrete local models when a single global model is not sufficient for all possible scenarios. <br />
<br />
For example, if a single PCA or PLS model does not perform sufficiently for all operating conditions but the operating conditions can be split into two or more easier-to-model subsets, a selector model can be used to choose between these subset models when applying the models to new data.<br />
<br />
Selector models consist of a trigger model (trigger) which can be a classification model or a set of one or more logical test strings to test the raw data or output of a regression or decomposition model and a set of two or more target models (target_1, target_2, etc) which can be any type of standard model structure, any standard object, or an empty array [ ] to indicate a null model.<br />
<br />
Model Selector models can also be created using the graphical [[Hierarchical Model Builder]] interface.<br />
<br />
====Trigger Models====<br />
<br />
Trigger models define the decision/rule nodes. Guidelines and rules for trigger models:<br />
<br />
* (A) A classification trigger model can be created using the standard classification modeling functions (PLSDA, KNN, SIMCA, SVMDA.) The model should be built with data representative of the sample types to which each target model can be applied. The number of classes separated by the model dictates the number of target models which can be selected from. The target models should be in the same order as the numerical class numbers used when building the model (e.g. if classes 1, 2 and 3 are used in PLSDA, the target models should be ordered so that target_1 is appropriate if the PLSDA model finds that a sample is class 1, target_2 is for class 2, and target_3 is for class 3.) A special case for PLSDA is that an "otherwise" model can be included which will be selected if the Q and T^2 values of the prediction are not within the limits specified by the qtestlimit and t2testlimit options.<br />
<br />
* (B) Logical test strings are specified as a trigger model by passing a cell containing one or more strings which perform a logical test on a variable from the data set. Variables are specified using either a label in double quotes (e.g. "flowrate"), or a axisscale value in quotes and square brackets (e.g. "[1530]"). The varaible can be used in any interpretable Matlab expression (including function calls) that returns a logical result. The simplest test could involve one of the Matlab logical comparison operators ( < > <= >= == and \~= ) and a value to which the given variable should be compared. For example, the target model:<br />
<br />
:<pre>{'"Fe">1100' '"Fe"<500'}</pre><br />
<br />
:tests if the variable named "Fe" is greater than 1100. If true, the target_1 model is applied, if not true, "Fe" is tested for being less than 500, and if so, target_2 is selected. If neither test is true, the "default" target model (i.e. target_3) is selected. <br />
<br />
:Example 2:<br />
<br />
:<pre>{'"[1745.3]"<=500'}</pre><br />
<br />
:tests if variable 1745.3 (on the variable axiscale) is less than or equal to 500. If true, target_1 is selected, if not true, default target model is selected. If variable 1745.3 does not exist, it is interpolated from the provided data.<br />
<br />
* (C) Logical test strings to be applied to the prediction from a simple regression (e.g. PLS, MLR, etc) or decomposition (e.g. PCA) model. When passed in this format, the first item in the cell array is a model to be applied to the data followed by a sequence of one or more logical tests to be performed on the predictions from the model application. The format of the logical tests is the same as in case (B) above EXCEPT any axisscale reference will be used to index into either the predicted values (for regression models) or the scores (for a decomposition model). For example, a test string of '"[2]">0.3' for a PLS trigger model would indicate that the second predicted y-value should be tested for being > 0.3. The tests will be performed on the predictions from the model.<br />
:Example 1:<br />
:<pre>{ regmodel '"[1]"<10' '"[1]"<100' }</pre><br />
:applies "regmodel" to the data then tests the predicted values against the tests <10 and <100 for a total of three classes. If the predicted value is <10, the first target model is selected. If it is <100, the second target model is selected. Otherwise, the third (default) model is selected. Note that this method can not generally be used on classification models. Instead, the (A) form of trigger models should be used.<br />
:Normally, the "label" form of the test strings is not valid for prediction and decomposition trigger models. However, two special cases are "Q" and "T2" which, if used in a test string, will test the Q and Hotelling's T^2 values output by the model. Note that, if the model type does not have Q or T^2 values (SVM model, for example) a test including Q or T2 will throw an error.<br />
<br />
====Target Models and Objects====<br />
<br />
Target models and objects define end-point nodes for each branch of the modelselector tree. When creating a selector model, there must be at least as many targets passed as there are classes (when trigger is a classification model) or strings (when trigger is a cell of logical test strings). There may also be an additional target (i.e. the "default") which is used if none of the classes or tests were positive.<br />
<br />
Note that target objects may be any standard model including another selector model (thus allowing multi-layer selector trees) or standard object including strings, numerical arrays, DataSet objects or even structures or cells.<br />
<br />
If a target is a model, the output of modelselector will be the result of applying the model to the input data (see also the 'applytarget' option which disables this behavior). If the target is not a model, the target object itself will be returned.<br />
<br />
One special target object is the "error" object. This object consists of a simple structure with one field named "error" which contains a string which should be used to throw an error. If the given target is selected, modelselector will throw a runtime error with the given string. This is useful in conjunction with on-line systems which should cease execution and send an alarm to a control system when the given situation occurs. This behavior can be changed by modifying the "errors" option (see below.)<br />
<br />
====Applying Models====<br />
<br />
When a single row of new data is passed as a dataset along with the selector model itself. the output is the result of applying the selected target model (target_model) or the target object, along with a unique description of the "branch(s)" taken to select the target model as a vector of branch numbers (applymodel). For example, given a multi-layer selector model containing:<br />
<br />
<pre>selector_model -> target_1 = PCA_model_A1<br />
target_2 = Selector_model -> target_1 = PCA_model_B1<br />
target_2 = PCA_model_B2 <br />
target_3 = PCA_model_A2</pre><br />
<br />
a returned value for applymodel of [2 1] implies that the second target model was selected from the first layer of target models, and this model was another selector model. From that second selector model, the first target model (PCA_model_B1) was selected and that is what was returned.<br />
<br />
If more than one row of new data is passed to modelselector, each row will be predicted and the output will be either a cell array of objects (prediction structures, strings) or a dataset with one row for each input row (when numerical values are output or numerical prediction results are extracted using outputfilters option)<br />
<br />
Note that if there are multiple "branches" (trigger models) the data passed to modelselector must contain all the data necessary for all trigger models within the selector model. If some of those variables are not used by a given model, modelselector will automatically discard unneeded variables before applying each trigger model.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
<br />
* '''multiple''' : [{'otherwise'} | 'mostprobable' | 'fail' ] Governs behavior when more than one class of a classification model is assigned. 'fail' will throw an error. 'mostprobable' will choose the target that corresponds to the most probable class. 'otherwise' will use the last target (otherwise.)<br />
* '''outputfilters''' : {} Provides information on how to filter output results after selection of a target. The value of this option is stored in the model indicating how the output of each corresponding target should be indexed. It is a cell array equal in length to the number of targets. Each cell element can contain another cell array with one or more of the following:<br />
::(A) a standard subscript indexing as defined by the Matlab "substruct" command<br />
::(B) a 2-element cell array containing a string label in the first cell and a standard substruct indexing structure in the second cell (as with A above)<br />
::(C) a string or numerical constant to be included verbatim Each content of the cell(to be concatenated row-wise). If any top-level cell elements are missing or the corresponding element is empty, no filtering is done.<br />
<pre> EXAMPLE:<br />
{<br />
{ {'my prediction' substruct('.','prediction')} {'Q residuals' substruct('.','Q')} }<br />
{ {'my prediction' substruct('.','prediction')} {'Q residuals' [0] }<br />
{ }<br />
}<br />
</pre><br />
::Would grab the "prediction" and "Q" outputs from a model for the first target and would grab only the "prediction" field for the second target and add a 0 (zero) to that. No filtering would be done on the third target. The odd use of [0] in row 2 is useful because some models may not output Q residuals, but you might still want to always output a second value so predictions from the first model and the second model are extracted to form an identical output (two columns.)<br />
<br />
* '''applytarget''' : [ 'off' |{'on'}] When 'on' any target that is a model is automatically applied to the data. Note that modelselector models are ALWAYS automatically applied to the data.<br />
* '''errors''' : [{'throw'}| 'struct' | 'string' ] Governs handling of error targets (structure with field "error" and a string). If 'throw', the content of the error field is thrown as an error. If 'string' the string content of the error field is returned with a prefix of "ERROR: ". If 'struct' the entire error structure is returned.<br />
* '''qtestlimit''' : [3] Governs Q limit testing for PLSDA models (over this reduced value = otherwise branch is used)<br />
* '''t2testlimit''' : [3] Governs T2 limit testing for PLSDA models (over this reduced value = otherwise branch is used)<br />
* '''addtrigmodels''' : [{'trendtool'}] Specifies which models (other than standard) are allowed as trigger models. This option is mostly for future use as new methods are added which may be usable as trigger models.<br />
<br />
* '''waitbar''' : [{'off'} | 'on'] Governs display of waitbar while processing.<br />
<br />
===See Also===<br />
<br />
[[Hierarchical Model Builder]], [[knn]], [[lwrpred]], [[plsda]], [[simca]], [[svmda]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mlr&diff=10978Mlr2020-01-03T22:17:53Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Multiple Linear Regression for multivariate Y.<br />
<br />
===Synopsis===<br />
<br />
:model = mlr(x,y,options)<br />
:pred = mlr(x,model,options)<br />
:valid = mlr(x,y,model,options)<br />
:mlr % Launches analysis window with MLR as the selected method.<br />
<br />
Please note that the recommended way to build a MLR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
MLR identifies models of the form Xb = y + e.<br />
<br />
====Inputs====<br />
<br />
* '''y''' = X-block: predictor block (2-way array or DataSet Object)<br />
<br />
* '''y''' = Y-block: predictor block (2-way array or DataSet Object)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = scalar, estimate of filtered data.<br />
<br />
* '''pred''' = structure array with predictions<br />
<br />
* '''valid''' = structure array with predictions<br />
<br />
===Options ===<br />
<br />
'''options''' = a structure array with the following fields.<br />
<br />
* '''display''': [ {'off'} | 'on'] Governs screen display to command line.<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
<br />
* '''ridge''': [ 0 ] ridge parameter to use in regularizing the inverse.<br />
<br />
* '''preprocessing''': { [] [] } preprocessing structure (see PREPROCESS).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
====Studentized Residuals====<br />
From version 8.8 onwards, the Studentized Residuals shown for MLR Scores Plot are now calculated for calibration samples as:<br />
MSE = sum((res).^2)./(m-1);<br />
syres = res./sqrt(MSE.*(1-L));<br />
where res = y residual, m = number of samples, and L = sample leverage.<br />
This represents a constant multiplier change from how Studentized Residuals were previously calculated.<br />
For test datasets, where pres = predicted y residual, the semi-Studentized residuals are calculated as:<br />
MSE = sum((res).^2)./(m-1);<br />
syres = pres./sqrt(MSE);<br />
This represents a constant multiplier change from how the semi-Studentized Residuals were previously calculated.<br />
<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[ils_esterror]], [[modelstruct]], [[pcr]], [[pls]], [[preprocess]], [[ridge]], [[testrobustness]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Lwr&diff=10977Lwr2020-01-03T22:16:36Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
LWR locally weighted regression for univariate Y.<br />
<br />
===Synopsis===<br />
<br />
:model = lwr(x,y,ncomp,''npts'',''options''); %identifies model (calibration step)<br />
:pred = lwr(x,model,''options''); %makes predictions with a new X-block<br />
:valid = lwr(x,y,model,''options''); %makes predictions with new X- & Y-block<br />
:lwr % Launches an Analysis window with LWR as the selected method.<br />
<br />
Please note that the recommended way to build a LWR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]].<br />
<br />
===Description===<br />
<br />
LWR calculates a single locally weighted regression model using the given number of principal components <tt>ncomp</tt> to predict a dependent variable <tt>y</tt> from a set of independent variables <tt>x</tt>. <br />
<br />
LWR models are useful for performing predictions when the dependent variable, <tt>y</tt>, has a non-linear relationship with the measured independent variables, <tt>x</tt>. Because such responses can often be approximated by a linear function on a small (local) scale, LWR models work by choosing a subset of the calibration data (the "local" calibration samples) to create a "local" model for a given new sample. The local calibration samples are identified as the samples closest to a new sample in the score space of a PCA model (the "selector model".), using the Mahalanobis distance measure. Models are defined using the number principal components used for the selector model (<tt>ncomp</tt>), and the number of points (samples) selected as local (<tt>npts</tt>). <br />
<br />
Once the samples are selected, one of three algorithms are used to calculate the local model:<br />
:* '''globalpcr''' = the scores from the PCA selector model (for the selected samples) are used to calculate a PCR model. This model is more stable when there are fewer samples being selected, but may not perform as well with high degrees of non-linearity.<br />
:* '''pcr''' / '''pls''' = the raw data of the selected samples are used to create a weighted PCR or PLS model. These models are more adaptable to highly varying non-linearity but may also be less stable when fewer samples are being selected. <br />
<br />
The LWR function can be used in 'predicton mode' to apply a previously built LWR model, <tt>model</tt>, to a new set of data in <tt>x</tt>, in order to generate y-values for these data. <br />
<br />
Furthermore, if matching x-block and y-block measurements are available for an external test set, then LWR can be used in 'validation mode' to predict the y-values of the test data from the model <tt>model</tt> and <tt>x</tt>, and allow comparison of these predicted y-values to the known y-values <tt>y</tt>.<br />
<br />
For more information on the basic LWR algorithm, see <tt>T. Naes, T. Isaksson, B. Kowalski, Anal Chem 62 (1990) 664-673.</tt><br />
For details on the use of y distance when selecting nearest points (option alpha), see <tt>Z. Wang, T. Isaksson, B. R. Kowalski, (1994). Anal Chem 66 (1994) 249–260.</tt><br />
<br />
Note: Calling lwr with no inputs starts the graphical user interface (GUI) for this analysis method. There is a<br />
[[Image:Movie.png|link=http://www.eigenvector.com/eigenguide.php?m=Nonlinear_methods_3]]<br />
[http://www.eigenvector.com/eigenguide.php?m=Nonlinear_methods_3 video using the LWR interface] on the Eigenvector Research web page.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset"<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset"<br />
* '''ncomp''' = the number of latent variables to be calculated (positive integer scalar)<br />
* '''npts''' = the number of points to use in local regression (positive integer scalar)<br />
* '''model''' = previously generated lwr model<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model (see [[Standard Model Structure]])<br />
* '''pred''' a structure, similar to '''model''', that contains scores, predictions, etc. for the new data.<br />
* '''valid''' a structure, similar to '''model''', that contains scores, predictions, and additional y-block statistics, etc. for the new data.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''waitbar''': [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'globalpcr' | {'pcr'} | 'pls' ] LWR algorithm to use. Method of regression after samples are selected. 'globalpcr' performs PCR based on the PCs calculated from the entire calibration data set but a regression vector calculated from only the selected samples. 'pcr' and 'pls' calculate a local PCR or PLS model based only on the selected samples.<br />
* '''blockdetails''': [ {'standard'} | 'all' ], extent of predictions and residuals included in model, 'standard' = only y-block, 'all' x- and y-blocks.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
* '''reglvs''': [] Used only when algorithm is 'pcr' or 'pls', this is the number of latent variables/principal components to use in the local regression model, if different from the number selected in the SSQ Table. The number of components selected in the SSQ table is used to generate the global PCA model which is used to select the local calibration samples. [] (Empty) implies LWRPRED should use the same number of latent variables in the local regression as were used in the global PCA model. NOTE: This option is NOT used when algorithm is 'globalpcr'.<br />
* '''iter''': [{5}] Iterations in determining local points. Used only when alpha > 0 (i.e. when using y-distance scaling).<br />
* '''alpha''': [ {0} ], has value in range [0-1]. Weighting of y-distances in selection of local points. 0 = do not consider y-distances {default}, 1 = consider ONLY y-distances. With any positive alpha, the algorithm will tend to select samples which are close in both the PC space but which also have similar y-values. This is accomplished by repeating the prediction multiple times. In the first iteration, the selection of samples is done only on the PC space. Subsequent iterations take into account the comparison between predicted y-value of the new sample and the measured y-values of the calibration samples.<br />
The default options can be retreived using: options = lwr('options');.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[ann]], [[lwrpred]], [[modelstruct]], [[pls]], [[pcr]], [[preprocess]], [[svm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Knn&diff=10976Knn2020-01-03T22:15:31Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
K-nearest neighbor classifier.<br />
<br />
===Synopsis===<br />
<br />
:pclass = knn(xref,xtest,k,options); %make prediction without model<br />
:pclass = knn(xref,xtest,options); %use default k<br />
:model = knn(xref,k,options) %create model<br />
:modelp = knn(xref,model,k,options) %apply model to xtest<br />
:modelp = knn(xtest,model,options) %apply model to xtest; predictions (equivalent to pclass) in modelp.classification.mostprobable.<br />
:[pclass,closest,votes] = knn(xref,xtest,k,options); %make prediction without model<br />
:[pclass,closest,votes] = knn(xref,xtest,options); %use default k<br />
:[pclass,closest,votes] = knn(xref,k,options); %self-prediction without model<br />
: knn % Launches an Analysis window with KNN as the selected method.<br />
<br />
Please note that the recommended way to build a K-nearest neighbor model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs kNN classification where the "k" closest samples in a reference set vote on the class of an unknown sample based on distance to the reference samples. If no majority is found, the unknown is assigned the class of the closest sample (see input options for other no-majority behaviors).<br />
<br />
====Inputs====<br />
<br />
* '''xref''' = a DataSet object of reference data,<br />
<br />
* '''xtest''' = a DataSet object or Double containing the unknown test data.<br />
<br />
====Optional Inputs====<br />
<br />
* '''''model''' '' = an optional standard KNN model structure which can be passed instead of xref (note order of inputs: (xtest,model) ) to apply model to test data.<br />
<br />
* '''k''' = number of components {default = rank of X-block}.<br />
<br />
====Outputs====<br />
<br />
* '''pclass''' = the voted closest class, if a majority of nearest neighbors were of the same class, or the class of the closest sample, if no majority was found (Only returned if xtest is supplied).<br />
<br />
* '''closest''' = matrix of samples (rows) by closest neighbor index (columns). Will always have k columns indicating which samples were the closest to the given sample (row).<br />
* '''votes''' = maxtix of samples (rows) by class numbers voted for (columns). Will always have k columns indicating which classes were voted for by each nearest neighbor corresponding to closest matrix.<br />
<br />
* '''model''' = if no test data (xtest) is supplied, a standard model structure is returned which can be used with test data in the future to perform a prediction. Note that information about the classification of X-block samples is available in the '''classification''' field, described at [[Standard_Model_Structure#model|Standard Model]]. <br />
<br />
For more information on class predictions, see [[Sample Classification Predictions]].<br />
<br />
===Options===<br />
<br />
'''options''' = structure array with the following fields :<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to screen.<br />
<br />
* '''waitbar''' : [ 'off' | 'on' |{'auto'}] governs display of a waitbar when classifying. 'on' always shows a waitbar, 'off' never shows a waitbar, 'auto' shows a waitbar only when the data is particularly large.<br />
<br />
* '''preprocessing''': { [ ] } A cell containing a preprocessing structure or keyword (see PREPROCESS). Use {'autoscale'} to perform autoscaling on reference and test data.<br />
<br />
* '''classset''' : [ 1 ] indicates which class set in xref to use.<br />
<br />
* '''nomajority''': [ 'error' | {'closest'} | class_number ] Behavior when no majority is found in the votes. 'closest' = return class of closest sample. 'error' = give error message. class_number (i.e. any numerical value) = return this value for no-majority votes (e.g. use 0 to return zero for all no-majority votes)<br />
<br />
* '''strictthreshold''': Probability threshold value to use in strict class assignment, see [[Sample_Classification_Predictions#Class_Pred_Strict]]. Default = 0.5.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[cluster]], [[dbscan]], [[knnscoredistance]], [[modelselector]], [[plsda]], [[simca]], [[svmda]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Frpcr&diff=10975Frpcr2020-01-03T22:14:09Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Full-ratio PCR calibration and prediction.<br />
<br />
===Synopsis===<br />
<br />
:model = frpcr(x,y,ncomp,''options'') %calibration<br />
:pred = frpcr(x,model,''options'') %prediction<br />
:valid = frpcr(x,y,model,''options'') %validation<br />
<br />
Please note that the recommended way to build a Full-ratio PCR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
FRPCR calculates a single full-ratio PCR model using the given number of components <tt>ncomp</tt> to predict <tt>y</tt> from measurements <tt>x</tt>. Random multiplicative scaling of each sample can be used to aid model stability. Full-Ratio PCR models are based on the simultaneous regression for both y-block prediction and scaling variations (such as those due to pathlength and collection efficiency variations). The resulting PCR model is insensitive to absolute scaling errors.<br />
<br />
NOTE: For best results, the x-block should not be mean-centered.<br />
<br />
Inputs are <tt>x</tt> the predictor block (2-way array or dataset object), <tt>y</tt> the predicted block (2-way array or dataset object), <tt>ncomp</tt> the number of components to to be calculated (positive integer scalar) and the optional options structure, ''options''.<br />
<br />
The output of the function is a standard model structure <tt>model</tt>. In prediction and validation modes, the same model structure is used but predictions are provided in the <tt>model.detail.pred</tt> field.<br />
<br />
Although the full-ratio method uses a different method for determination of the regression vector, the fundamental idea is very similar to the optimized scaling 2 method as described in:<br />
<br />
T.V. Karstang and R. Manne, "Optimized scaling: A novel approach to linear calibration with close data sets", Chemom. Intell. Lab. Syst., '''14''', 165-173 (1992).<br />
<br />
====Inputs====<br />
<br />
* '''x''' = input x-block (should not be mean-centered), 2-way double array or dataset object.<br />
* '''y''' = input y-block, 2-way double array or dataset object, calibration and validation modes.<br />
* '''ncomp''' = number of components, calibration mode.<br />
* '''model''' = existing model, prediction and validation modes.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = model generated in calibration mode.<br />
* '''pred''' = prediction results, prediction mode.<br />
* '''valid''' = validation results, validation mode.<br />
<br />
===Options===<br />
<br />
''options'' = a structure with the following fields:<br />
<br />
* '''pathvar''': [ {0.2} ] standard deviation for random multiplicative scaling. A value of zero will disable the random sample scaling but may increase model sensitivity to scaling errors,<br />
<br />
* '''useoffset''': [ {'off'} | 'on' ] flag determining use of offset term in regression equations (may be necessary for mean-centered x-block),<br />
<br />
* '''display''': [ {'off'} | 'on' ] governs level of display to command window,<br />
<br />
* '''plots''': [ {'none'} | 'intermediate' | 'final' ] governs level of plotting,<br />
<br />
* '''preprocessing''': {[ ] [ ]} cell of two preprocessing structures (see PREPROCESS) defining preprocessing for the x- and y-blocks.<br />
<br />
* '''algorithm''': [ {'direct'} | 'empirical' ] governs solution algorithm. Direct solution is fastest and most stable. Only empirical will work on single-factor models when useoffset is 'on', and<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ - the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ] Confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits.<br />
<br />
In addition, there are several options relating to the algorithm. See FRPCRENGINE.<br />
<br />
===See Also===<br />
<br />
[[frpcrengine]], [[mscorr]], [[pcr]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Dspls&diff=10974Dspls2020-01-03T22:12:25Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial Least Squares computational engine using Direct Scores algorithm.<br />
<br />
===Synopsis===<br />
<br />
:[reg,ssq,xlds,ylds,wts,xscrs,yscrs,basis] = dspls(x,y,ncomp,options)<br />
<br />
Please note that the recommended way to build a PLS model using the Direct Scores algorithm from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs PLS regression using Direct Scores PLS algorithm as described in Andersson, "A comparison of nine PLS1 algorithms", J. Chemometrics, (www.interscience.wiley.com) DOI: 10.1002/cem.1248<br />
<br />
This modified SIMPLS algorithm provides improved numerical stability for high numbers of latent variables.<br />
<br />
Note: The regression matrices are ordered in '''reg''' such that each '''ny''' (number of Y-block variables) rows correspond to the regression matrix for that particular number of latent variables.<br />
<br />
====Inputs====<br />
* '''x''' = X-block (predictor block) class "double".<br />
* '''y''' = Y-block (predicted block) class "double".<br />
<br />
====Optional Inputs====<br />
* '''ncomp''' = the number of latent variables to be calculated (positive integer scalar {default = rank of X-block}.<br />
<br />
====Outputs====<br />
* '''reg''' = matrix of regression vectors.<br />
* '''ssq''' = the sum of squares captured.<br />
* '''xlds''' = X-block loadings.<br />
* '''ylds''' = Y-block loadings.<br />
* '''wts''' = X-block weights, currently returns empty.<br />
* '''xscrs''' = X-block scores.<br />
* '''yscrs''' = Y-block scores, currently returns empty.<br />
* '''basis''' = the basis of X-block loadings.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
<br />
* '''display''' : [ 'off' |{'on'}] governs display to command window<br />
* '''ranktest''' : [ 'none' | 'data' | 'scores' | {'auto'} ] governs type of rank test to perform.<br />
: 'data' = single test on X-block (faster with smaller data blocks and more components).<br />
: 'scores' = test during regression on scores matrix (faster with larger data matricies).<br />
: 'auto' = auto selection, or 'none' = assume sufficient rank.<br />
<br />
===See Also===<br />
<br />
[[nippls]], [[pcr]], [[pls]], [[plsnipal]], [[simpls]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Cls&diff=10973Cls2020-01-03T22:11:04Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Classical Least Squares regression for multivariate Y.<br />
<br />
===Synopsis===<br />
<br />
: model = cls(x,options); %identifies model (calibration step)<br />
: model = cls(x,y,options); %identifies model (calibration step)<br />
: pred = cls(x,model,options); %makes predictions with a new X-block<br />
: valid = cls(x,y,model,options); %makes predictions with new X- & Y-block<br />
: cls % Launches the Analysis window with CLS as the selected method.<br />
<br />
Please note that the recommended way to build a CLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
CLS identifies models of the form '''y = Xb + e'''.<br />
<br />
====Inputs====<br />
* '''x''' = X-block: predictor block (2-way array or DataSet Object).<br />
<br />
====Optional Inputs====<br />
<br />
* '''y''' = Y-block: predicted block (2-way array or DataSet Object). The number of columns of y indicates the number of components in the model (each row specifies the mixture present in the given sample). If y is omitted, x is assumed to be a set of pure component responses (e.g. spectra) defining the model itself.<br />
<br />
====Outputs====<br />
* '''model''' = standard model structure containing the CLS model (See [[Standard Model Structure]]).<br />
* '''pred''' = structure array with predictions.<br />
* '''valid''' = structure array with predictions.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
* '''preprocessing''': { [] [] } preprocessing structure (see PREPROCESS).<br />
* '''algorithm''': [ {'ls'} | 'nnls' | 'snnls' | 'cnnls' | 'stepwise' | 'stepwisennls' ] Specifies the regression algorithm.<br />
:Options are: <br />
: ls = a standard least-squares fit.<br />
: snnls = non-negative least squares on spectra (S) only.<br />
: cnnls = non-negative least squares on concentrations (C) only.<br />
: nnls = non-negative least squares fit on both C and S.<br />
: stepwise = stepwise least squares<br />
: stepwisennls = stepwise non-negative least squares<br />
<br />
* '''confidencelimit''': [{0.95}] Confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[pcr]], [[pls]], [[preprocess]], [[stepwise regrcls]], [[testrobustness]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Annda&diff=10972Annda2020-01-03T22:09:44Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Predictions based on Artificial Neural Network (ANNDA) classification models.<br />
ANNDA Artificial Neural Network (ANNDA) for classification. Use ANN for Artificial Neural Network regression([[Ann]]).<br />
<br />
===Synopsis===<br />
: annda - Launches an Analysis window with ANNDA as the selected method. <br />
: [model] = annda(x, opts); <br />
: [model] = annda(x,y,options);<br />
: [model] = annda(x,y, nhid, options);<br />
: [pred] = annda(x,model,options);<br />
: [valid] = annda(x,model,options);<br />
: [valid] = annda(x,y,model,options); <br />
<br />
Please note that the recommended way to build an ANNDA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Build an ANNDA model from input dataset X, or input X and Y if classes are in Y, using the specified number of layers and layer nodes. <br />
Alternatively, if a model is passed in ANNDA makes a prediction for an input test X block. The ANNDA model <br />
contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in <br />
to ANNDA then these weights do not need to be calculated. <br />
<br />
There are two implementations of ANNDA available referred to as 'BPN' and 'Encog'. <br />
:BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.<br />
:Encog is a feedforward ANN using Resilient Backpropagation training. See [http://en.wikipedia.org/wiki/Rprop Rprop] for further details. <br />
Encog is implemented using the Encog framework [http://www.heatonresearch.com/encog Encog] provided by <br />
Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are <br />
available at [http://www.heatonresearch.com/wiki/Main_Page#Encog_Documentation Encog Documentation]. <br />
BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead. <br />
Both implementations should give similar results but one may be faster than the other for different datasets. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (optional) class "double" sample class values,<br />
* '''nhid''' = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'ANNDA',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
====Training Termination====<br />
The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.<br />
<br />
BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV based on the calibration data over a range of learning iterations values (1 to options.learncycles). The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This RMSECV value is based on pre-processed, scaled values and so it is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.<br />
<br />
Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations <br />
becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit). <br />
Note these RMSE values refer to the internal preprocessed and scaled y values.<br />
<br />
====Cross-validation====<br />
Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
* '''display''' : [ 'off' |{'on'}] Governs display<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''blockdetails''' : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.<br />
* '''waitbar''' : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''algorithm''' : [{'bpn'} | 'encog'] ANN implementation to use.<br />
* '''nhid1''' : [{2}] Number of nodes in first hidden layer.<br />
* '''nhid2''' : [{0}] Number of nodes in second hidden layer.<br />
* '''learnrate''' : [0.125] ANN backpropagation learning rate (bpn only).<br />
* '''learncycles''' : [20] Number of ANN learning iterations (bpn only).<br />
* '''terminalrmse''' : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).<br />
* '''terminalrmserate''' : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).<br />
* '''maxseconds''' : [{20}] Maximum duration of ANN training in seconds (encog only).<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANNDA model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANNDA more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''': [{'yes'} | 'no'] Use Mahalnobis Distance corrected.<br />
* '''cvmethod''' : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.<br />
* '''cvsplits''' : [{5}] Number of CV subsets.<br />
* '''cvi''' : ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:<br />
::cvi(i) = -2 the sample is always in the test set.<br />
::cvi(i) = -1 the sample is always in the calibration set,<br />
::cvi(i) = 0 the sample is always never used, and<br />
::cvi(i) = 1,2,3... defines each test subset.<br />
* '''activationfunction''' : For the default algorithm, 'bpn', this option uses a 'sigmoid' activation function, f(x) = 1/(1+exp(-x)). For the 'encog' algorithm this activationfunction option has two choices, 'tanh' as default, or 'sigmoid'.<br />
<br />
===Additional information on the ‘BPN’ ANNDA implementation===<br />
The “BPN” implementation of ANNDA is a conventional feedforward back-propagation neural network where the weights are updated, or ‘trained’, so as to reduce the magnitude of the prediction error, except that the gradient-descent method of updating the weights is different from the usual “delta rule” approach. In the traditional delta-rule method the weights are changed at each increment of training time by a constant fraction of the contributing error gradient terms, leading to a reduced prediction error. In this “BPN” implementation the search for optimal weights by gradient-descent is treated as a continuous system, rather than incremental. The evolution of the weights with respect to training time is solved as a set of differential equations using a solver appropriate for systems where the solution (weights) may involve very different timescales. Most weights evolve slowly towards their final values but some weights may have periods of faster change. A reference paper for the BPN implementation is:<br />
<br />
Owens A J and Filkin D L 1989 Efficient training of the back propagation network by solving a system of stiff<br />
ordinary differential equations Proc. Int. Joint Conf. on Neural Networks vol II (IEEE Press) pp 381–6.<br />
<br />
====Algorithm parameters: learncycles and learnrate====<br />
This BPN technique results in much faster training that with the traditional delta-rule approach. The training is governed by two parameters, ‘learncycles’ and ‘learnrate’. The learnrate parameter specifies the training time duration of the first learncycle. Each subsequent learncycle’s time duration is twice the previous learncycle’s duration. The performance of the ANN is evaluated at the end of each learncycle interval by calculating the cross-validation prediction error, RMSECV. The RMSECV initially decreases rapidly with training time but eventually starts to increase again as the ANN begins to overfit the data. The number of training cycles which yields the minimum RMSECV therefore provides an estimate of the optimal ANN training duration, for the given learnrate value. The ANN model contains these RMSECV values in model.detail.ann.rmsecviter, and the optimal, minimum RMSECV occurs at index model.detail.ann.niter, which will be smaller than or equal to the learncycles value. It is useful to check rmsecviter to see if a minimum RMSECV has been attained, but also to see if you are using too many learn cycles. Reducing the number of learncycles can significantly speed up ANN training.<br />
Note, the model.detail.ann.rmsecviter values are only used to pick the optimal number of learncycles. These rmsecviter values are calculated using scaled y and should not be compared to the reported RMSEC, RMSECV or RMSEP.<br />
<br />
====Usage from ANNDA Analysis window====<br />
<br />
The command line function “annda” has input parameter “nhid” specifying the number of nodes in the hidden layer(s) and builds the optimal model for that network. When using the ANNDA Analysis window, however, it is possible to specify a scan over a range of hidden layer nodes to use. This is enabled by setting the “Maximum number of Nodes” value in the cross-validation window. This only works for BPN ANNDAs having a single hidden layer. This causes ANNDA models to be built for the range of hidden layer nodes up to the specified number and the resulting RMSECV plotted versus the number of nodes is shown by clicking on the “Plot cross-validation results” plot icon in the ANNDA Analysis window’s toolbar. This can be useful for deciding how many nodes to use. Note that this plot is only advisory. The resulting model is built with the input parameter number of nodes, ‘nhid’, and its model.detail.rmsecv value relates to this number of nodes. It is important to check for the optimal number of nodes to use in the ANN but this feature can greatly lengthen the time taken to build the ANNDA model and should be be set = 1 once the number of hidden nodes is decided.<br />
<br />
====Summary of model building speed-up settings====<br />
<br />
=====From the Analysis window:=====<br />
ANNDA in PLS_Toolbox or Solo version 8.2 can be very slow if you use cross-validation (CV). This is mostly due to the CV settings window also specifying a test to find the optimal number of hidden layer 1 nodes, testing ANN models with 1, 2, …,20 nodes, each with CV. This is set by the top slider field “Maximum Number of Nodes L1”. For example, if you want to build an ANN model with 4 layer 1 nodes (using the “ANNDA Settings” field) but leave the CV settings window’s top slider set = 20, then you will actually build 20 models, each with CV, and save the RMSECV from each. This can be very slow, especially for the models with many nodes.<br />
<br />
To make ANNDA perform faster it is recommended that you drag this CV window’s “Maximum Number of Nodes L1” slider to the left, setting = 1, unless you really want to see the results of such a parameter search over the range specified by this slider. This is the default in PLS_Toolbox and Solo versions after version 8.2. The RMSECV versus number of Layer 1 Nodes can be seen by clicking on the “Plot cross-validation results” icon (next to the Scores Plot icon).<br />
<br />
Summary: To make ANNDA perform faster:<br />
<br />
1. Move the top CV slider to the left, setting value = 1.<br />
<br />
2. Turning CV off or using a small number of CV splits.<br />
<br />
3. Choose to use a small number of L1 nodes in the ANNDA settings window.<br />
<br />
4. Don't use 2 hidden layers. This is very slow.<br />
<br />
=====From the command line=====<br />
1. Initially build ANNDA without cross-validation so as to decide on values for learnrate and learncycles by examining where the minimum value of model.detail.ann.rmscviter occurs versus learncycles. Note this uses a single-split CV to estimate rmsecv when the ANNDA cross-validation is set as "None". It is inefficient to use a larger than necessary value for option "learncycles".<br />
<br />
2. Determine the number of hidden layer nodes to use by building a range of models with different number of nodes, nhid1, nhid2. If using the ANNDA Analysis window and the ANN has a single hidden layer then this can be done conveniently by using the “Maximum number of Nodes L1” setting in the cross-validation settings window. It is best to use a simple cross-validation at this stage, with a small number of splits and iterations at this survey stage.<br />
<br />
===See Also===<br />
<br />
[[annda]], [[analysis]], [[crossval]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Ann&diff=10971Ann2020-01-03T22:08:41Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Predictions based on Artificial Neural Network (ANN) regression models.<br />
<br />
===Synopsis===<br />
: ann - Launches an Analysis window with ANN as the selected method. <br />
: [model] = ann(x,y,options);<br />
: [model] = ann(x,y, nhid, options);<br />
: [pred] = ann(x,model,options);<br />
: [valid] = ann(x,y,model,options);<br />
<br />
Please note that the recommended way to build an ANN model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]].<br />
<br />
===Description===<br />
<br />
Build an ANN model from input X and Y block data using the specified number of layers and layer nodes. <br />
Alternatively, if a model is passed in ANN makes a Y prediction for an input test X block. The ANN model <br />
contains quantities (weights etc) calculated from the calibration data. When a model structure is passed in <br />
to ANN then these weights do not need to be calculated. <br />
<br />
There are two implementations of ANN available referred to as 'BPN' and 'Encog'. <br />
:BPN is a feedforward ANN using backpropagation training and is implemented in Matlab.<br />
:Encog is a feedforward ANN using Resilient Backpropagation training. See [http://en.wikipedia.org/wiki/Rprop Rprop] for further details. <br />
Encog is implemented using the Encog framework [http://www.heatonresearch.com/encog Encog] provided by <br />
Heaton Research, Inc, under the Apache 2.0 license. Further details of Encog Neural Network features are <br />
available at [http://www.heatonresearch.com/wiki/Main_Page#Encog_Documentation Encog Documentation]. <br />
BPN is the ANN version used by default but the user can specify the option 'algorithm' = 'encog' to use Encog instead. <br />
Both implementations should give similar results but one may be faster than the other for different datasets. <br />
BPN is currently the only version which calculates RMSECV.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (predictor block) class "double" or "dataset", containing numeric values,<br />
* '''y''' = Y-block (predicted block) class "double" or "dataset", containing numeric values,<br />
* '''nhid''' = number of nodes in a single hidden layer ANN, or vector of two two numbers, indicating a two hidden layer ANN, representing the number of nodes in the two hidden layers. (this takes precedence over options nhid1 and nhid2),<br />
* '''model''' = previously generated model (when applying model to new data).<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'ANN',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
** '''detail''': sub-structure with additional model details and results, including:<br />
*** model.detail.ann.W: Structure containing details of the ANN, including the ANN type, number of hidden layers and the weights.<br />
<br />
* '''pred''' a structure, similar to '''model''' for the new data.<br />
<br />
====Training Termination====<br />
The ANN is trained on a calibration dataset to minimize prediction error, RMSEC. It is important to not overtrain, however, so some some criteria for ending training are needed.<br />
<br />
BPN determines the optimal number of learning iteration cycles by selecting the minumum RMSECV based on the calibration data over a range of learning iterations values (1 to options.learncycles). The cross-validation used is determined by option cvi, or else by cvmethod. If neither of these are specified then the minumum RMSEP using a single subset of samples from a 5-fold random split of the calibration data is used. This RMSECV value is based on pre-processed, scaled values and so it is not saved in the model.rmsecv field. Apply cross-validation (see below) to add this information to the model.<br />
<br />
Encog training terminates whenever either a) RMSE becomes smaller than the option 'terminalrmse' value, or b) the rate of improvement of RMSE per 100 training iterations <br />
becomes smaller than the option 'terminalrmserate' value, or c) time exceeds the option 'maxseconds' value (though results are not optimal if is stopped prematurely by this time limit). <br />
Note these RMSE values refer to the internal preprocessed and scaled y values.<br />
<br />
====Cross-validation====<br />
Cross-validation can be applied to ANN when using either the ANN Analysis window or the command line. From the Analysis window specify the cross-validation method in the usual way (clicking on the model icon's red check-mark, or the "Choose Cross-Validation" link in the flowchart). In the cross-validation window the "Maximum Number of Nodes" specifies how many hidden-layer 1 nodes to test over. Viewing RMSECV versus number of hidden-layer 1 nodes (toolbar icon to left of Scores Plot) is useful for choosing the number of layer 1 nodes. From the command line use the crossval method to add crossvalidation information to an existing model.<br />
<br />
===Options===<br />
<br />
options = a structure array with the following fields:<br />
* '''display''' : [ 'off' |{'on'}] Governs display<br />
* '''plots''': [ {'none'} | 'final' ] governs plotting of results.<br />
* '''blockdetails''' : [ {'standard'} | 'all' ] extent of detail included in model. 'standard' keeps only y-block, 'all' keeps both x- and y- blocks.<br />
* '''waitbar''' : [ 'off' |{'auto'}| 'on' ] governs use of waitbar during analysis. 'auto' shows waitbar if delay will likely be longer than a reasonable waiting period.<br />
* '''algorithm''' : [{'bpn'} | 'encog'] ANN implementation to use.<br />
* '''nhid1''' : [{2}] Number of nodes in first hidden layer.<br />
* '''nhid2''' : [{0}] Number of nodes in second hidden layer.<br />
* '''learnrate''' : [0.125] ANN backpropagation learning rate (bpn only).<br />
* '''learncycles''' : [20] Number of ANN learning iterations (bpn only).<br />
* '''terminalrmse''' : [0.05] Termination RMSE value (of scaled y) for ANN iterations (encog only).<br />
* '''terminalrmserate''' : [1.e-9] Termination rate of change of RMSE per 100 iterations (encog only).<br />
* '''maxseconds''' : [{20}] Maximum duration of ANN training in seconds (encog only).<br />
* '''preprocessing''': {[] []} preprocessing structures for x and y blocks (see PREPROCESS).<br />
* '''compression''': [{'none'}| 'pca' | 'pls' ] type of data compression to perform on the x-block prior to calculaing or applying the ANN model. 'pca' uses a simple PCA model to compress the information. 'pls' uses a pls model. Compression can make the ANN more stable and less prone to overfitting.<br />
* '''compressncomp''': [1] Number of latent variables (or principal components to include in the compression model.<br />
* '''compressmd''': [{'yes'} | 'no'] Use Mahalnobis Distance corrected.<br />
* '''cvmethod''' : [{'con'} | 'vet' | 'loo' | 'rnd'] CV method, OR [] for Kennard-Stone single split.<br />
* '''cvsplits''' : [{5}] Number of CV subsets.<br />
* '''cvi''' : ''M'' element vector with integer elements allowing user defined subsets. (cvi) is a vector with the same number of elements as x has rows i.e., length(cvi) = size(x,1). Each cvi(i) is defined as:<br />
::cvi(i) = -2 the sample is always in the test set.<br />
::cvi(i) = -1 the sample is always in the calibration set,<br />
::cvi(i) = 0 the sample is always never used, and<br />
::cvi(i) = 1,2,3... defines each test subset.<br />
* '''activationfunction''' : For the default algorithm, 'bpn', this option uses a 'sigmoid' activation function, f(x) = 1/(1+exp(-x)). For the 'encog' algorithm this activationfunction option has two choices, 'tanh' as default, or 'sigmoid'.<br />
<br />
===Additional information on the ‘BPN’ ANN implementation===<br />
The “BPN” implementation of ANN is a conventional feedforward back-propagation neural network where the weights are updated, or ‘trained’, so as to reduce the magnitude of the prediction error, except that the gradient-descent method of updating the weights is different from the usual “delta rule” approach. In the traditional delta-rule method the weights are changed at each increment of training time by a constant fraction of the contributing error gradient terms, leading to a reduced prediction error. In this “BPN” implementation the search for optimal weights by gradient-descent is treated as a continuous system, rather than incremental. The evolution of the weights with respect to training time is solved as a set of differential equations using a solver appropriate for systems where the solution (weights) may involve very different timescales. Most weights evolve slowly towards their final values but some weights may have periods of faster change. A reference paper for the BPN implementation is:<br />
<br />
Owens A J and Filkin D L 1989 Efficient training of the back propagation network by solving a system of stiff<br />
ordinary differential equations Proc. Int. Joint Conf. on Neural Networks vol II (IEEE Press) pp 381–6.<br />
<br />
====Algorithm parameters: learncycles and learnrate====<br />
This BPN technique results in much faster training that with the traditional delta-rule approach. The training is governed by two parameters, ‘learncycles’ and ‘learnrate’. The learnrate parameter specifies the training time duration of the first learncycle. Each subsequent learncycle’s time duration is twice the previous learncycle’s duration. The performance of the ANN is evaluated at the end of each learncycle interval by calculating the cross-validation prediction error, RMSECV. The RMSECV initially decreases rapidly with training time but eventually starts to increase again as the ANN begins to overfit the data. The number of training cycles which yields the minimum RMSECV therefore provides an estimate of the optimal ANN training duration, for the given learnrate value. The ANN model contains these RMSECV values in model.detail.ann.rmsecviter, and the optimal, minimum RMSECV occurs at index model.detail.ann.niter, which will be smaller than or equal to the learncycles value. It is useful to check rmsecviter to see if a minimum RMSECV has been attained, but also to see if you are using too many learn cycles. Reducing the number of learncycles can significantly speed up ANN training.<br />
Note, the model.detail.ann.rmsecviter values are only used to pick the optimal number of learncycles. These rmsecviter values are calculated using scaled y and should not be compared to the reported RMSEC, RMSECV or RMSEP.<br />
<br />
====Usage from ANN Analysis window====<br />
<br />
The command line function “ann” has input parameter “nhid” specifying the number of nodes in the hidden layer(s) and builds the optimal model for that network. When using the ANN Analysis window, however, it is possible to specify a scan over a range of hidden layer nodes to use. This is enabled by setting the “Maximum number of Nodes” value in the cross-validation window. This only works for BPN ANNs having a single hidden layer. This causes ANN models to be built for the range of hidden layer nodes up to the specified number and the resulting RMSECV plotted versus the number of nodes is shown by clicking on the “Plot cross-validation results” plot icon in the ANN Analysis window’s toolbar. This can be useful for deciding how many nodes to use. Note that this plot is only advisory. The resulting model is built with the input parameter number of nodes, ‘nhid’, and its model.detail.rmsecv value relates to this number of nodes. It is important to check for the optimal number of nodes to use in the ANN but this feature can greatly lengthen the time taken to build the ANN model and should be be set = 1 once the number of hidden nodes is decided.<br />
<br />
====Summary of model building speed-up settings====<br />
<br />
=====From the Analysis window:=====<br />
ANN in PLS_Toolbox or Solo version 8.2 can be very slow if you use cross-validation (CV). This is mostly due to the CV settings window also specifying a test to find the optimal number of hidden layer 1 nodes, testing ANN models with 1, 2, …,20 nodes, each with CV. This is set by the top slider field “Maximum Number of Nodes L1”. For example, if you want to build an ANN model with 4 layer 1 nodes (using the “ANN Settings” field) but leave the CV settings window’s top slider set = 20, then you will actually build 20 models, each with CV, and save the RMSECV from each. This can be very slow, especially for the models with many nodes.<br />
<br />
To make ANN perform faster it is recommended that you drag this CV window’s “Maximum Number of Nodes L1” slider to the left, setting = 1, unless you really want to see the results of such a parameter search over the range specified by this slider. This is the default in PLS_Toolbox and Solo versions after version 8.2. The RMSECV versus number of Layer 1 Nodes can be seen by clicking on the “Plot cross-validation results” icon (next to the Scores Plot icon).<br />
<br />
Summary: To make ANN perform faster:<br />
<br />
1. Move the top CV slider to the left, setting value = 1.<br />
<br />
2. Turning CV off or using a small number of CV splits.<br />
<br />
3. Choose to use a small number of L1 nodes in the ANN settings window.<br />
<br />
4. Don't use 2 hidden layers. This is very slow.<br />
<br />
=====From the command line=====<br />
1. Initially build ANN without cross-validation so as to decide on values for learnrate and learncycles by examining where the minimum value of model.detail.ann.rmscviter occurs versus learncycles. Note this uses a single-split CV to estimate rmsecv when the ANN cross-validation is set as "None". It is inefficient to use a larger than necessary value for option "learncycles".<br />
<br />
2. Determine the number of hidden layer nodes to use by building a range of models with different number of nodes, nhid1, nhid2. If using the ANN Analysis window and the ANN has a single hidden layer then this can be done conveniently by using the “Maximum number of Nodes L1” setting in the cross-validation settings window. It is best to use a simple cross-validation at this stage, with a small number of splits and iterations at this survey stage.<br />
<br />
===See Also===<br />
<br />
[[annda]], [[analysis]], [[crossval]], [[lwr]], [[modelselector]], [[pls]], [[pcr]], [[preprocess]], [[svm]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mcr&diff=10970Mcr2020-01-03T21:46:24Z<p>Lyle: </p>
<hr />
<div><br />
===Purpose===<br />
<br />
Multivariate curve resolution with constraints.<br />
<br />
===Synopsis===<br />
<br />
:model = mcr(x,ncomp,''options'') %calibrate <br />
:model = mcr(x,c0,''options'') %calibrate with explict initial guess<br />
:pred = mcr(x,model,''options'') %predict<br />
:mcr % Launches an Analysis window with mcr as the selected method.<br />
<br />
Please note that the recommended way to build a MCR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
MCR decomposes a matrix '''X''' as '''CS''' such that '''X''' = '''CS''' + '''E''' where '''E''' is minimized in a least squares sense. By default, this is done using the alternating least squares (ALS) algorithm. For details on the ALS algorithm and constraints available in MCR, see the [[als]] reference page.<br />
<br />
When called with new data and a model structure, MCR performs a prediction (applies the model to the new data) returning the projection of the new data onto the previously recovered loadings (i.e. estimated spectra).<br />
<br />
In addition to the constraints and options listed in [[als]], other pages which may be of interest include [[MCR Constraints]] which describes setting constraints in the [Analysis] interface, and [[MCR Contrast Constraint]] which discusses the contrast constraint option.<br />
<br />
====Inputs====<br />
* '''x''' = the matrix to be decomposed (size ''m'' by ''n'')<br />
* '''ncomp''' or '''c0''' or '''model''' :<br />
** '''ncomp''' = the number of components to extract<br />
** '''c0''' = the explicit initial guess where, if c0 is size ''m'' by ''k'', where ''k'' is the number of factors, then it is assumed to be the initial guess for '''C'''. If c0 is size ''k'' by ''n'' then it is assumed to be the initial guess for '''S'''. If ''m''=''n'' then, c0 is assumed to be the initial guess for '''C'''. Optional input ''options'' is described below.<br />
** '''model''' = a previously calculated MCR model structure to apply to the data in input '''x'''.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure containing the results of the analysis. The estimated contributions '''C '''are stored in model.loads{2} and the estimated spectra '''S '''in model.loads{1}. Sum-squared residuals for samples and variables can be found in model.ssqresiduals{1} and model.ssqresiduals{2}, respectively. See the chemometrics tutorial for more information on the MCR method and models. Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.<br />
<br />
===Options===<br />
<br />
* '''''options''''' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
<br />
* '''waitbar''': [ 'off' | 'on' | {'auto'} ] governs use of waitbar,<br />
<br />
* '''preprocessing''': { [] } preprocessing to apply to x-block (see PREPROCESS).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''initmode''': [1 | 2] Mode of x for automatic initialization.<br />
<br />
* '''confidencelimit''': [{0.95}] Confidence level for Q limits. <br />
<br />
* '''alsoptions''': ['options'] options passed to ALS subroutine (see ALS).<br />
<br />
The default options can be retreived using: options = mcr('options');.<br />
<br />
===See Also===<br />
<br />
[[als]], [[analysis]], [[evolvfa]], [[ewfa]], [[fasternnls]], [[fastnnls]], [[fastnnls_sel]], [[mlpca]], [[parafac]], [[parafac2]], [[plotloads]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pcr&diff=10969Pcr2020-01-03T21:45:37Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Principal Components Regression: multivariate inverse least squares regression.<br />
<br />
===Synopsis===<br />
<br />
:model = pcr(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pcr(x,model,''options'') %applies model to a new X-block<br />
:valid = pcr(x,y,model,''options'') %applies model to a new X-block, with corresponding new Y values<br />
:pcr % Launches an Analysis window with PCR as the selected method.<br />
<br />
Please note that the recommended way to build a PCR model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PCR calculates a single principal components regression model using the given number of components <tt>ncomp</tt> to predict <tt>y</tt> from measurements <tt>x</tt>, OR applies an existing PCR model to a new set of data <tt>x</tt><br />
<br />
To make predictions, the inputs are <tt>x</tt> the new predictor x-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model. The output <tt>pred</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, etc. for the new data.<br />
<br />
If new y-block measurements are also available for the new data, then the inputs are <tt>x</tt> the new x-block (2-way array class "double" or "dataset"), <tt>y</tt> the new y-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model to apply. The output <tt>valid</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, and additional y-block statistics etc. for the new data.<br />
<br />
In prediction and validation modes, the same model structure is used but predictions are provided in the <tt>model.detail.pred</tt> field.<br />
<br />
Note: Calling '''pcr''' with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block data (2-way array or DataSet Object)<br />
* '''y''' = Y-block data (2-way array or DataSet Object)<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' discussed below<br />
<br />
====Outputs====<br />
<br />
The output is a standard model structure with the following fields (see [[Standard Model Structure]]):<br />
<br />
* '''modeltype''': 'PCR',<br />
* '''datasource''': structure array with information about input data,<br />
* '''date''': date of creation,<br />
* '''time''': time of creation,<br />
* '''info''': additional model information,<br />
* '''reg''': regression vector,<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
* '''pred''': 2 element cell array containing model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array), and the y-block predictions.<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
* '''description''': cell array with text description of model, and<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting,<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively),<br />
<br />
* '''algorithm''': [ {'svd'} | ' robustpcr' | ' correlationpcr' | 'frpcr' ], governs which algorithm to use.<br />
** 'svd' = standard singular value decomposition algorithm. <br />
** 'robustpcr' = robust algorithm with automatic outlier detection. <br />
** 'correlationpcr' = standard PCR with re-ordering of factors in order of y-variance captured.<br />
** 'frpcr' = full-ratio PCR (a.k.a. optimized scaling) with automatic sample scale correction. Note that with FRPCR, models generally perform better without mean-centering on the x-block.<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits,<br />
<br />
* '''roptions''': structure of options to pass to '''rpcr''' (robust PCR engine from the Libra Toolbox). Only used when algorithm is 'robustpcr',<br />
<br />
* '''alpha''' : [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpcr'.<br />
<br />
* '''intadjust''' : [ {0} ], if equal to one, the intercept adjustment for the LTS-regression will be calculated. See '''ltsregres''' for details (Libra Toolbox).<br />
<br />
The default options can be retreived using: options = pcr('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,t,p] = pcr(x,y,ncomp,''options'')<br />
<br />
where the outputs are<br />
<br />
* '''b''' = matrix of regression vectors or matrices for each number of principal components up to ncomp,<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''t''' = x-block scores, and<br />
<br />
* '''p''' = x-block loadings.<br />
<br />
Note: The regression matrices are ordered in '''b''' such that each ''Ny'' (number of y-block variables) rows correspond to the regression matrix for that particular number of principal components.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[frpcr]], [[mlr]], [[modelstruct]], [[pca]], [[pls]], [[preprocess]], [[ridge]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pls&diff=10968Pls2020-01-03T21:45:05Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial least squares regression for univariate or multivariate y-block.<br />
<br />
===Synopsis===<br />
<br />
:model = pls(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pls(x,model,''options'') %makes predictions with a new X-block<br />
:valid = pls(x,y,model,''options'') %makes predictions with new X- & Y-block<br />
:pls % launches analysis window with PLS selected<br />
<br />
Please note that the recommended way to build a PLS model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PLS calculates a single partial least squares regression model using the given number of components <tt>ncomp</tt> to predict a dependent variable <tt>y</tt> from a set of independent variables <tt>x</tt>.<br />
<br />
Alternatively, PLS can be used in 'predicton mode' to apply a previously built PLS model in <tt>model</tt> to an external set of test data in <tt>x</tt> (2-way array class "double" or "dataset"), in order to generate y-values for these data. <br />
<br />
Furthermore, if matching x-block and y-block measurements are available for an external test set, then PLS can be used in 'validation mode' to predict the y-values of the test data from the model <tt>model</tt> and <tt>x</tt>, and allow comparison of these predicted y-values to the known y-values <tt>y</tt>.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = the independent variable (X-block) data (2-way array class "double" or class "dataset")<br />
* '''y''' = the dependent variable (Y-block) data (2-way array class "double" or class "dataset")<br />
* '''ncomp''' = the number of components to to be calculated (positive integer scalar)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'PLS',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''reg''': regression vector,<br />
** '''loads''': cell array with model loadings for each mode/dimension,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array),and<br />
*** the y-block predictions.<br />
** '''wts''': double array with X-block weights,<br />
** '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
** '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
** '''description''': cell array with text description of model, and<br />
** '''detail''': sub-structure with additional model details and results.<br />
<br />
* '''pred''' a structure, similar to '''model''', that contains scores, predictions, etc. for the new data.<br />
<br />
* '''valid''' a structure, similar to '''model''', that contains scores, predictions, and additional y-block statistics, etc. for the new data.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''outputversion''': [ 2 | {3} ], governs output format (see below),<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'nip' | {'sim'} | 'dspls' | 'robustpls' ], PLS algorithm to use: NIPALS, SIMPLS {default}, Direct Scores, or robust pls (with automatic outlier detection).<br />
* '''orthogonalize''': [ {'off'} | 'on' ] Orthogonalize model to condense y-block variance into first latent variable; 'on' = produce orthogonalized model. Regression vector and predictions are NOT changed by this option, only the loadings, weights, and scores. See [[orthogonalizepls]] for more information.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
*'''weights''': [ {'none'} | 'hist' | 'custom' ] governs sample weighting. 'none' does no weighting. 'hist' performs histogram weighting in which large numbers of samples at individual y-values are down-weighted relative to small numbers of samples at other values. 'custom' uses the weighting specified in the weightsvect option.<br />
*'''weightsvect''': [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.<br />
* '''roptions''': structure of options to pass to rsimpls (robust PLS engine from the Libra Toolbox).<br />
** '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpls'.<br />
<br />
The default options can be retreived using: options = pls('options');.<br />
<br />
====Outputversion====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,p,q,w,t,u,bin] = pls(x,y,ncomp,''options'')<br />
<br />
where the outputs are as defined for the [[nippls]] function. This is provided for backwards compatibility. It is recommended that users call the [[simpls]] or [[nippls]] functions directly.<br />
<br />
There is also a difference in the scores and loadings returned by the old version and the new (default) version. The old version (outputversion=2) keeps the variance in the loadings and the scores are normalized. The new version (outputversion=3) keeps the variance in the scores and has normalized loadings. The older format is related to the usage in the original algorithm publications. The newer format is used in order to maintain a standardized format across all PLS algorithms (robust PLS, and DSPLS).<br />
<br />
===Algorithm===<br />
<br />
Note that unlike previous versions of the PLS function, the default algorithm (see Options, above) is the faster SIMPLS algorithm. If the alternate NIPALS algorithm is to be used, the options.algorithm field should be set to 'nip'.<br />
<br />
Option 'robustpls' enables a robust method for Partial Least Squares Regression based on the SIMPLS algorithm. This uses the function 'rsimpls' from the well-known LIBRA Toolbox, developed by Mia Hubert's research group at the Katholieke Universiteit Leuven (kuleuven.be). The RSIMPLS method is described in: Hubert, M., and Vanden Branden, K. (2003), "Robust Methods for Partial Least Squares Regression", Journal of Chemometrics, 17, 537-549.<br />
<br />
====Studentized Residuals====<br />
From version 8.8 onwards, the Studentized Residuals shown for PLS Scores Plot are now calculated for calibration samples as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = res./sqrt(MSE.*(1-L));<br />
where res = y residual, m = number of samples, ncomp = number of LV components and L = sample leverage.<br />
This represents a constant multiplier change from how Studentized Residuals were previously calculated.<br />
For test datasets the semi-Studentized residuals are calculated as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = pres./sqrt(MSE);<br />
This represents a constant multiplier change from how the semi-Studentized Residuals were previously calculated.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[mlr]], [[modelstruct]], [[nippls]], [[pcr]], [[plsda]], [[preprocess]], [[ridge]], [[simpls]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pca&diff=10967Pca2020-01-03T21:44:00Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Perform principal components analysis.<br />
<br />
===Synopsis===<br />
<br />
<br />
:model = pca(x,ncomp,options); %identifies model (calibration step)<br />
:pred = pca(x,model,options); %projects a new X-block onto existing model<br />
:pca % Launches Analysis window with PCA selected<br />
<br />
Please note that the recommended way to build a PCA model from the command line is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs a principal component analysis decomposition of the input array data returning ncomp principal components. E.g. for an ''M'' by ''N'' matrix <tt>X</tt> the PCA model is <math>X = TP^T + E</math>, where the scores matrix '''T''' is ''M'' by ''K'', the loadings matrix '''P''' is ''N'' by ''K'', the residuals matrix '''E''' is ''M'' by ''N'', and ''K'' is the number of factors or principal components <tt>ncomp</tt>. The output <tt>model</tt> is a PCA model structure. This model can be applied to new data by passing the model structure to PCA along with new data <tt>x</tt> or by using [[pcapro]].<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (2-way array class "double" or "dataset"), and<br />
<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''model''' = existing PCA model, onto which new data '''x''' is to be applied.<br />
<br />
* '''''options''''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
The output of PCA is a model structure with the following fields (see [[Standard Model Structure]] for additional information):<br />
<br />
* '''modeltype''': 'PCA',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for the input block (when blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
If the inputs are a ''M''<sub>new</sub> by ''N'' matrix newdata and and a PCA model model, then PCA applies the model to the new data. Preprocessing included in model will be applied to newdata. The output pred is structure, similar to model, that contains the new scores, and other predictions for newdata.<br />
<br />
Note: Calling pca with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting.<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''algorithm''': [ {'svd'} | 'maf' | 'robustpca' ], algorithm for decomposition. Note that algorithm 'maf' ([[maxautofactors | Maximum Autocorrelation Factors]] for hyperspectral images) requires Eigenvector's MIA_Toolbox,<br />
<br />
* '''preprocessing''': {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidencelimits.<br />
<br />
* '''roptions''': structure of options to pass to robpca (robust PCA engine from the Libra Toolbox).<br />
<br />
* '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpca'.<br />
<br />
* '''cutoff''': [] Similar to confidencelimit, this confidence level is used by the robust algorithm to indicate which sample(s) are considered outside the limits and, therefore, likely outliers. It does NOT indicate which samples were actually left out (see alpha above), but only those samples which appear to be more unusual. Default value is the same value as confidencelimit (if non-zero) or alpha (if confidencelimit is zero.)<br />
<br />
The default options can be retreived using: options = pca('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[scores,loads,ssq,res,reslm,tsqlm,tsq] = pca(xblock1,2,options);<br />
<br />
where the outputs are<br />
<br />
* '''scores''' = x-block scores,<br />
<br />
* '''loads''' = x-block loadings<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''res''' = the Q residuals,<br />
<br />
* '''reslim''' = the estimated 95% confidence limit line for Q residuals,<br />
<br />
* '''tsqlim''' = the estimated 95% confidence limit line for T<sup>2</sup>, and<br />
<br />
* '''tsq''' = the Hotelling's T<sup>2</sup> values.<br />
<br />
====PREPROCESSING====<br />
<br />
The preprocessing field can be empty [] (indicating that no preprocessing of the data should be used), or it can contain a preprocessing structure output from the PREPROCESS function. For example options.preprocessing = {preprocess('default', 'autoscale')}. This information is echoed in the output model in the model.detail.preprocessing field and is used when applying the PCA model to new data.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[evolvfa]], [[ewfa]], [[explode]], [[parafac]], [[plotloads]], [[plotscores]], [[preprocess]], [[ssqtable]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Mcr&diff=10966Mcr2020-01-03T20:32:50Z<p>Lyle: </p>
<hr />
<div><br />
===Purpose===<br />
<br />
Multivariate curve resolution with constraints.<br />
<br />
===Synopsis===<br />
<br />
:model = mcr(x,ncomp,''options'') %calibrate <br />
:model = mcr(x,c0,''options'') %calibrate with explict initial guess<br />
:pred = mcr(x,model,''options'') %predict<br />
:mcr % Launches an Analysis window with mcr as the selected method.<br />
<br />
Please note that this is not the recommended way to build a MCR model from the command line. The recommended way is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
MCR decomposes a matrix '''X''' as '''CS''' such that '''X''' = '''CS''' + '''E''' where '''E''' is minimized in a least squares sense. By default, this is done using the alternating least squares (ALS) algorithm. For details on the ALS algorithm and constraints available in MCR, see the [[als]] reference page.<br />
<br />
When called with new data and a model structure, MCR performs a prediction (applies the model to the new data) returning the projection of the new data onto the previously recovered loadings (i.e. estimated spectra).<br />
<br />
In addition to the constraints and options listed in [[als]], other pages which may be of interest include [[MCR Constraints]] which describes setting constraints in the [Analysis] interface, and [[MCR Contrast Constraint]] which discusses the contrast constraint option.<br />
<br />
====Inputs====<br />
* '''x''' = the matrix to be decomposed (size ''m'' by ''n'')<br />
* '''ncomp''' or '''c0''' or '''model''' :<br />
** '''ncomp''' = the number of components to extract<br />
** '''c0''' = the explicit initial guess where, if c0 is size ''m'' by ''k'', where ''k'' is the number of factors, then it is assumed to be the initial guess for '''C'''. If c0 is size ''k'' by ''n'' then it is assumed to be the initial guess for '''S'''. If ''m''=''n'' then, c0 is assumed to be the initial guess for '''C'''. Optional input ''options'' is described below.<br />
** '''model''' = a previously calculated MCR model structure to apply to the data in input '''x'''.<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure containing the results of the analysis. The estimated contributions '''C '''are stored in model.loads{2} and the estimated spectra '''S '''in model.loads{1}. Sum-squared residuals for samples and variables can be found in model.ssqresiduals{1} and model.ssqresiduals{2}, respectively. See the chemometrics tutorial for more information on the MCR method and models. Note that the sum-squared captured table contains various statistics on the information captured by each component. Please see [[MCR and PARAFAC Variance Captured]] for details.<br />
<br />
===Options===<br />
<br />
* '''''options''''' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ] governs level of display to command window.<br />
<br />
* '''plots''': [ 'none' | {'final'} ] governs level of plotting.<br />
<br />
* '''waitbar''': [ 'off' | 'on' | {'auto'} ] governs use of waitbar,<br />
<br />
* '''preprocessing''': { [] } preprocessing to apply to x-block (see PREPROCESS).<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''initmode''': [1 | 2] Mode of x for automatic initialization.<br />
<br />
* '''confidencelimit''': [{0.95}] Confidence level for Q limits. <br />
<br />
* '''alsoptions''': ['options'] options passed to ALS subroutine (see ALS).<br />
<br />
The default options can be retreived using: options = mcr('options');.<br />
<br />
===See Also===<br />
<br />
[[als]], [[analysis]], [[evolvfa]], [[ewfa]], [[fasternnls]], [[fastnnls]], [[fastnnls_sel]], [[mlpca]], [[parafac]], [[parafac2]], [[plotloads]], [[preprocess]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pcr&diff=10965Pcr2020-01-03T20:21:15Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Principal Components Regression: multivariate inverse least squares regression.<br />
<br />
===Synopsis===<br />
<br />
:model = pcr(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pcr(x,model,''options'') %applies model to a new X-block<br />
:valid = pcr(x,y,model,''options'') %applies model to a new X-block, with corresponding new Y values<br />
:pcr % Launches an Analysis window with PCR as the selected method.<br />
<br />
Please note that this is not the recommended way to build a PCR model from the command line. The recommended way is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PCR calculates a single principal components regression model using the given number of components <tt>ncomp</tt> to predict <tt>y</tt> from measurements <tt>x</tt>, OR applies an existing PCR model to a new set of data <tt>x</tt><br />
<br />
To make predictions, the inputs are <tt>x</tt> the new predictor x-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model. The output <tt>pred</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, etc. for the new data.<br />
<br />
If new y-block measurements are also available for the new data, then the inputs are <tt>x</tt> the new x-block (2-way array class "double" or "dataset"), <tt>y</tt> the new y-block (2-way array class "double" or "dataset"), and <tt>model</tt> the PCR model to apply. The output <tt>valid</tt> is a structure, similar to <tt>model</tt>, that contains scores, predictions, and additional y-block statistics etc. for the new data.<br />
<br />
In prediction and validation modes, the same model structure is used but predictions are provided in the <tt>model.detail.pred</tt> field.<br />
<br />
Note: Calling '''pcr''' with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block data (2-way array or DataSet Object)<br />
* '''y''' = Y-block data (2-way array or DataSet Object)<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''options''' discussed below<br />
<br />
====Outputs====<br />
<br />
The output is a standard model structure with the following fields (see [[Standard Model Structure]]):<br />
<br />
* '''modeltype''': 'PCR',<br />
* '''datasource''': structure array with information about input data,<br />
* '''date''': date of creation,<br />
* '''time''': time of creation,<br />
* '''info''': additional model information,<br />
* '''reg''': regression vector,<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
* '''pred''': 2 element cell array containing model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array), and the y-block predictions.<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
* '''description''': cell array with text description of model, and<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting,<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively),<br />
<br />
* '''algorithm''': [ {'svd'} | ' robustpcr' | ' correlationpcr' | 'frpcr' ], governs which algorithm to use.<br />
** 'svd' = standard singular value decomposition algorithm. <br />
** 'robustpcr' = robust algorithm with automatic outlier detection. <br />
** 'correlationpcr' = standard PCR with re-ordering of factors in order of y-variance captured.<br />
** 'frpcr' = full-ratio PCR (a.k.a. optimized scaling) with automatic sample scale correction. Note that with FRPCR, models generally perform better without mean-centering on the x-block.<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidence limits,<br />
<br />
* '''roptions''': structure of options to pass to '''rpcr''' (robust PCR engine from the Libra Toolbox). Only used when algorithm is 'robustpcr',<br />
<br />
* '''alpha''' : [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpcr'.<br />
<br />
* '''intadjust''' : [ {0} ], if equal to one, the intercept adjustment for the LTS-regression will be calculated. See '''ltsregres''' for details (Libra Toolbox).<br />
<br />
The default options can be retreived using: options = pcr('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,t,p] = pcr(x,y,ncomp,''options'')<br />
<br />
where the outputs are<br />
<br />
* '''b''' = matrix of regression vectors or matrices for each number of principal components up to ncomp,<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''t''' = x-block scores, and<br />
<br />
* '''p''' = x-block loadings.<br />
<br />
Note: The regression matrices are ordered in '''b''' such that each ''Ny'' (number of y-block variables) rows correspond to the regression matrix for that particular number of principal components.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[frpcr]], [[mlr]], [[modelstruct]], [[pca]], [[pls]], [[preprocess]], [[ridge]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pls&diff=10964Pls2020-01-03T20:09:20Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Partial least squares regression for univariate or multivariate y-block.<br />
<br />
===Synopsis===<br />
<br />
:model = pls(x,y,ncomp,''options'') %identifies model (calibration step)<br />
:pred = pls(x,model,''options'') %makes predictions with a new X-block<br />
:valid = pls(x,y,model,''options'') %makes predictions with new X- & Y-block<br />
:pls % launches analysis window with PLS selected<br />
<br />
Please note that this is not the recommended way to build a PLS model from the command line. The recommended way is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
PLS calculates a single partial least squares regression model using the given number of components <tt>ncomp</tt> to predict a dependent variable <tt>y</tt> from a set of independent variables <tt>x</tt>.<br />
<br />
Alternatively, PLS can be used in 'predicton mode' to apply a previously built PLS model in <tt>model</tt> to an external set of test data in <tt>x</tt> (2-way array class "double" or "dataset"), in order to generate y-values for these data. <br />
<br />
Furthermore, if matching x-block and y-block measurements are available for an external test set, then PLS can be used in 'validation mode' to predict the y-values of the test data from the model <tt>model</tt> and <tt>x</tt>, and allow comparison of these predicted y-values to the known y-values <tt>y</tt>.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method. <br />
<br />
====Inputs====<br />
<br />
* '''x''' = the independent variable (X-block) data (2-way array class "double" or class "dataset")<br />
* '''y''' = the dependent variable (Y-block) data (2-way array class "double" or class "dataset")<br />
* '''ncomp''' = the number of components to to be calculated (positive integer scalar)<br />
<br />
====Outputs====<br />
<br />
* '''model''' = a standard model structure model with the following fields (see [[Standard Model Structure]]):<br />
** '''modeltype''': 'PLS',<br />
** '''datasource''': structure array with information about input data,<br />
** '''date''': date of creation,<br />
** '''time''': time of creation,<br />
** '''info''': additional model information,<br />
** '''reg''': regression vector,<br />
** '''loads''': cell array with model loadings for each mode/dimension,<br />
** '''pred''': 2 element cell array with<br />
*** model predictions for each input block (when options.blockdetail='normal' x-block predictions are not saved and this will be an empty array),and<br />
*** the y-block predictions.<br />
** '''wts''': double array with X-block weights,<br />
** '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
** '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
** '''description''': cell array with text description of model, and<br />
** '''detail''': sub-structure with additional model details and results.<br />
<br />
* '''pred''' a structure, similar to '''model''', that contains scores, predictions, etc. for the new data.<br />
<br />
* '''valid''' a structure, similar to '''model''', that contains scores, predictions, and additional y-block statistics, etc. for the new data.<br />
<br />
Note: Calling pls with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
* '''plots''' [ 'none' | {'final'} ], governs level of plotting,<br />
* '''outputversion''': [ 2 | {3} ], governs output format (see below),<br />
* '''preprocessing''': {[] []}, two element cell array containing preprocessing structures (see PREPROCESS) defining preprocessing to use on the x- and y-blocks (first and second elements respectively)<br />
* '''algorithm''': [ 'nip' | {'sim'} | 'dspls' | 'robustpls' ], PLS algorithm to use: NIPALS, SIMPLS {default}, Direct Scores, or robust pls (with automatic outlier detection).<br />
* '''orthogonalize''': [ {'off'} | 'on' ] Orthogonalize model to condense y-block variance into first latent variable; 'on' = produce orthogonalized model. Regression vector and predictions are NOT changed by this option, only the loadings, weights, and scores. See [[orthogonalizepls]] for more information.<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for both X- & Y-blocks as well as the X- & Y-blocks themselves.<br />
*'''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits, a value of zero (0) disables calculation of confidence limits,<br />
*'''weights''': [ {'none'} | 'hist' | 'custom' ] governs sample weighting. 'none' does no weighting. 'hist' performs histogram weighting in which large numbers of samples at individual y-values are down-weighted relative to small numbers of samples at other values. 'custom' uses the weighting specified in the weightsvect option.<br />
*'''weightsvect''': [ ] Used only with custom weights. The vector specified must be equal in length to the number of samples in the y block and each element is used as a weight for the corresponding sample. If empty, no sample weighting is done.<br />
* '''roptions''': structure of options to pass to rsimpls (robust PLS engine from the Libra Toolbox).<br />
** '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpls'.<br />
<br />
The default options can be retreived using: options = pls('options');.<br />
<br />
====Outputversion====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[b,ssq,p,q,w,t,u,bin] = pls(x,y,ncomp,''options'')<br />
<br />
where the outputs are as defined for the [[nippls]] function. This is provided for backwards compatibility. It is recommended that users call the [[simpls]] or [[nippls]] functions directly.<br />
<br />
There is also a difference in the scores and loadings returned by the old version and the new (default) version. The old version (outputversion=2) keeps the variance in the loadings and the scores are normalized. The new version (outputversion=3) keeps the variance in the scores and has normalized loadings. The older format is related to the usage in the original algorithm publications. The newer format is used in order to maintain a standardized format across all PLS algorithms (robust PLS, and DSPLS).<br />
<br />
===Algorithm===<br />
<br />
Note that unlike previous versions of the PLS function, the default algorithm (see Options, above) is the faster SIMPLS algorithm. If the alternate NIPALS algorithm is to be used, the options.algorithm field should be set to 'nip'.<br />
<br />
Option 'robustpls' enables a robust method for Partial Least Squares Regression based on the SIMPLS algorithm. This uses the function 'rsimpls' from the well-known LIBRA Toolbox, developed by Mia Hubert's research group at the Katholieke Universiteit Leuven (kuleuven.be). The RSIMPLS method is described in: Hubert, M., and Vanden Branden, K. (2003), "Robust Methods for Partial Least Squares Regression", Journal of Chemometrics, 17, 537-549.<br />
<br />
====Studentized Residuals====<br />
From version 8.8 onwards, the Studentized Residuals shown for PLS Scores Plot are now calculated for calibration samples as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = res./sqrt(MSE.*(1-L));<br />
where res = y residual, m = number of samples, ncomp = number of LV components and L = sample leverage.<br />
This represents a constant multiplier change from how Studentized Residuals were previously calculated.<br />
For test datasets the semi-Studentized residuals are calculated as:<br />
MSE = sum((res).^2)./(m-ncomp);<br />
syres = pres./sqrt(MSE);<br />
This represents a constant multiplier change from how the semi-Studentized Residuals were previously calculated.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[crossval]], [[mlr]], [[modelstruct]], [[nippls]], [[pcr]], [[plsda]], [[preprocess]], [[ridge]], [[simpls]], [[EVRIModel_Objects]]</div>Lylehttps://www.wiki.eigenvector.com/index.php?title=Pca&diff=10963Pca2020-01-03T20:08:00Z<p>Lyle: </p>
<hr />
<div>===Purpose===<br />
<br />
Perform principal components analysis.<br />
<br />
===Synopsis===<br />
<br />
<br />
:model = pca(x,ncomp,options); %identifies model (calibration step)<br />
:pred = pca(x,model,options); %projects a new X-block onto existing model<br />
:pca % Launches Analysis window with PCA selected<br />
<br />
Please note that this is not the recommended way to build a PCA model from the command line. The recommended way is to use the Model Object. Please see [[EVRIModel_Objects | this wiki page on building models using the Model Object]]. <br />
<br />
===Description===<br />
<br />
Performs a principal component analysis decomposition of the input array data returning ncomp principal components. E.g. for an ''M'' by ''N'' matrix <tt>X</tt> the PCA model is <math>X = TP^T + E</math>, where the scores matrix '''T''' is ''M'' by ''K'', the loadings matrix '''P''' is ''N'' by ''K'', the residuals matrix '''E''' is ''M'' by ''N'', and ''K'' is the number of factors or principal components <tt>ncomp</tt>. The output <tt>model</tt> is a PCA model structure. This model can be applied to new data by passing the model structure to PCA along with new data <tt>x</tt> or by using [[pcapro]].<br />
<br />
====Inputs====<br />
<br />
* '''x''' = X-block (2-way array class "double" or "dataset"), and<br />
<br />
* '''ncomp''' = number of components to to be calculated (positive integer scalar).<br />
<br />
====Optional Inputs====<br />
<br />
* '''model''' = existing PCA model, onto which new data '''x''' is to be applied.<br />
<br />
* '''''options''''' = discussed below.<br />
<br />
====Outputs====<br />
<br />
The output of PCA is a model structure with the following fields (see [[Standard Model Structure]] for additional information):<br />
<br />
* '''modeltype''': 'PCA',<br />
<br />
* '''datasource''': structure array with information about input data,<br />
<br />
* '''date''': date of creation,<br />
<br />
* '''time''': time of creation,<br />
<br />
* '''info''': additional model information,<br />
<br />
* '''loads''': cell array with model loadings for each mode/dimension,<br />
<br />
* '''pred''': cell array with model predictions for the input block (when blockdetail='normal' x-block predictions are not saved and this will be an empty array)<br />
<br />
* '''tsqs''': cell array with T<sup>2</sup> values for each mode,<br />
<br />
* '''ssqresiduals''': cell array with sum of squares residuals for each mode,<br />
<br />
* '''description''': cell array with text description of model, and<br />
<br />
* '''detail''': sub-structure with additional model details and results.<br />
<br />
If the inputs are a ''M''<sub>new</sub> by ''N'' matrix newdata and and a PCA model model, then PCA applies the model to the new data. Preprocessing included in model will be applied to newdata. The output pred is structure, similar to model, that contains the new scores, and other predictions for newdata.<br />
<br />
Note: Calling pca with no inputs starts the graphical user interface (GUI) for this analysis method.<br />
<br />
===Options===<br />
<br />
''options'' = a structure array with the following fields:<br />
<br />
* '''display''': [ 'off' | {'on'} ], governs level of display to command window,<br />
<br />
* '''plots''': [ 'none' | {'final'} ], governs level of plotting.<br />
<br />
* '''outputversion''': [ 2 | {3} ], governs output format (discussed below),<br />
<br />
* '''algorithm''': [ {'svd'} | 'maf' | 'robustpca' ], algorithm for decomposition. Note that algorithm 'maf' ([[maxautofactors | Maximum Autocorrelation Factors]] for hyperspectral images) requires Eigenvector's MIA_Toolbox,<br />
<br />
* '''preprocessing''': {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),<br />
<br />
* '''blockdetails''': [ 'compact' | {'standard'} | 'all' ] level of detail (predictions, raw residuals, and calibration data) included in the model.<br />
:* ‘Standard’ = the predictions and raw residuals for the X-block as well as the X-block itself are not stored in the model to reduce its size in memory. Specifically, these fields in the model object are left empty: 'model.pred{1}', 'model.detail.res{1}', 'model.detail.data{1}'.<br />
:* ‘Compact’ = for this function, 'compact' is identical to 'standard'.<br />
:* 'All' = keep predictions, raw residuals for X-blocks as well as the X-blocks dataset itself.<br />
<br />
* '''confidencelimit''': [ {'0.95'} ], confidence level for Q and T2 limits. A value of zero (0) disables calculation of confidencelimits.<br />
<br />
* '''roptions''': structure of options to pass to robpca (robust PCA engine from the Libra Toolbox).<br />
<br />
* '''alpha''': [ {0.75} ], (1-alpha) measures the number of outliers the algorithm should resist. Any value between 0.5 and 1 may be specified. These options are only used when algorithm is 'robustpca'.<br />
<br />
* '''cutoff''': [] Similar to confidencelimit, this confidence level is used by the robust algorithm to indicate which sample(s) are considered outside the limits and, therefore, likely outliers. It does NOT indicate which samples were actually left out (see alpha above), but only those samples which appear to be more unusual. Default value is the same value as confidencelimit (if non-zero) or alpha (if confidencelimit is zero.)<br />
<br />
The default options can be retreived using: options = pca('options');.<br />
<br />
====OUTPUTVERSION====<br />
<br />
By default (options.outputversion = 3) the output of the function is a standard model structure model. If options.outputversion = 2, the output format is:<br />
<br />
:[scores,loads,ssq,res,reslm,tsqlm,tsq] = pca(xblock1,2,options);<br />
<br />
where the outputs are<br />
<br />
* '''scores''' = x-block scores,<br />
<br />
* '''loads''' = x-block loadings<br />
<br />
* '''ssq''' = the sum of squares information, <br />
<br />
* '''res''' = the Q residuals,<br />
<br />
* '''reslim''' = the estimated 95% confidence limit line for Q residuals,<br />
<br />
* '''tsqlim''' = the estimated 95% confidence limit line for T<sup>2</sup>, and<br />
<br />
* '''tsq''' = the Hotelling's T<sup>2</sup> values.<br />
<br />
====PREPROCESSING====<br />
<br />
The preprocessing field can be empty [] (indicating that no preprocessing of the data should be used), or it can contain a preprocessing structure output from the PREPROCESS function. For example options.preprocessing = {preprocess('default', 'autoscale')}. This information is echoed in the output model in the model.detail.preprocessing field and is used when applying the PCA model to new data.<br />
<br />
===See Also===<br />
<br />
[[analysis]], [[browse]], [[evolvfa]], [[ewfa]], [[explode]], [[parafac]], [[plotloads]], [[plotscores]], [[preprocess]], [[ssqtable]], [[EVRIModel_Objects]]</div>Lyle