User Defined Preprocessing

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search

Each preprocessing method available in the preprocess.m function is defined using a preprocessing structure. The standard methods are defined in the preprocatalog file, and additional user-defined methods can be defined in the preprouser.m file or in a Matlab binary preprocatalog.mat file. The methods defined in these three files are available to all tools making use of the preprocess function. See the preprouser file for examples of defining and adding methods to the available methods catalog. See also Adding Custom Methods to Solo (below)

Please note that the standard preprocessing methods cannot change the dimension of the data being processed, nor the include fields if it is a dataset. This is a PLS_Toolbox convention which custom user-defined preprocessing methods must also follow.

The fields in a preprocessing structure include:

  • description - textual description of the particular method.
  • calibrate - cell containing the line(s) to execute during a calibration operation.
  • apply - cell containing the line(s) to execute during an apply operation.
  • undo - cell containing the line(s) to execute during an undo operation.
  • out - cell used to hold calibration-phase results for use in apply or undo.
  • settingsgui - function name to execute when the "Settings" button is pressed in the GUI.
  • settingsonadd - Boolean: 1 = automatically bring up settings GUI when method is "added".
  • usesdataset - Boolean: indicates if this method should be passed a DataSet Object (1) or a raw matrix (0).
  • caloutputs - integer: number of expected items in field out after calibration has been performed.
  • keyword - text string which users can pass to preprocess to obtain this structure.
  • category - text string which associates this method with other similar methods (for display purposes only).
  • userdata - user-defined variable often used to store method options.

Detailed Field Descriptions

  • description: Contains a short (1-2 word) description of the method. The value will be displayed in the GUI and can also be used as a string keyword (see also keyword) to refer to this method. For example:
pp.description = 'Mean Center';
  • calibrate, apply, undo: Contain a single cell of one or more command strings to be executed by preprocess when performing calibration, apply or undo operations. Calibrate actions are done on original calibration data, whereas apply is done on new data. The undo action is used to remove preprocessing from previously preprocessed data, although it may be undefined for certain methods. If this is the case, the undo string should be an empty cell.
The command strings should be valid MATLAB commands. Each command will be executed inside the preprocess environment in which the following variables are available:
  • data: The actual data on which to operate and in which to return modified results. If the field usesdataset is 1 (one) then data will be a DataSet Object. Otherwise, data will be a standard MATLAB variable which has been extracted from the DataSet Object. In the latter case, the extracted data will not contain any excluded columns from the original DataSet Object. In addition, in calibrate mode rows excluded in the original DataSet Object will not be passed. All rows, excluded or not, are passed in apply and undo modes.
If writing a DataSet Object-enabled preprocessing function, it is expected that the function will calibrate using only included rows but apply and undo the preprocessing to all rows.
  • out: Contents of the preprocessing structure field out. Any changes will be stored in the preprocessing structure for use in subsequent apply and undo commands.
  • userdata: Contents of the preprocessing structure field userdata. Any changes will be stored in the preprocessing structure for later retrieval.
Read-only variables: Do not change the value of these variables.
  • include: Contents of the original DataSet Object's include field.
  • otherdata: Cell array of additional inputs to preprocess which follow the data. For example, these inputs are used by PLS_Toolbox regression functions to pass the y-block for use in methods which require that information.
  • originaldata: Original DataSet Object unmodified by any preprocessing steps. For example, originaldata can be used to retrieve axis scale or class information even when usesdataset is 0 (zero) and, thus, data is the extracted data.
In order to assure that all samples (rows) in the data have been appropriately preprocessed, an apply command is automatically performed following a calibrate call. Note that excluded variables are replaced with NaN.
Examples:
The following calibrate field performs mean-centering on data, returning both the mean-centered data as well as the mean values which are stored in out{1}:
pp.calibrate   = { '[data,out{1}] = mncn(data);' };
The following apply and undo fields use the scale and rescale functions to apply and undo the previously determined mean values (stored by the calibrate operation in out{1}) with new data:
pp.apply       = { 'data = scale(data,out{1});' };
pp.undo = { 'data = rescale(data,out{1});' };
  • out: Contains non-data outputs from calibrations. This output is usually information derived from the calibration data which is required to apply the preprocessing to new data. This field will be updated following a calibrate call, and the entire preprocessing structure will be output. For example, when autoscaling, the mean and standard deviation of all columns are stored as two entries in out{1} and out{2}. See the related field caloutputs. Usually, this field is empty in the default, uncalibrated structure.
  • settingsgui: Contains the name of a graphical user interface (GUI) function which allows the user to set options for this method. The function is expected to take as its only input a standard preprocessing structure, from which it should take the current settings and output the same preprocessing structure modified to meet the user's specifications. Typically, these changes are made to the userdata field, then the commands in the calibrate, apply and undo fields will use those settings.
The design of GUIs for selection of options is beyond the scope of this document and the user is directed to autoset.m and savgolset.m, both of which use GUIs to modify the userdata field of a preprocessing structure. Example:
pp.settingsgui   = 'autoset';
  • settingsonadd: Contains a Boolean (1=true, 0=false) value indicating if, when the user selects and adds the method in the GUI, the method's settingsgui should be automatically invoked. If a method usually requires the user to make some selection of options, settingsonadd=1 will guarantee that the user has had an opportunity to modify the options or at least choose the default settings.
Example:
pp.settingsonadd   = 1;
  • usesdataset: Contains a Boolean (1=true, 0=false) value indicating if the method is capable of handling DataSet Objects or can only handle standard MATLAB data types (double, uint8, etc).
1 = function can handle DataSet Objects; preprocess will pass entire DataSet Object. It is the responsibility of the function(s) called by the method to appropriately handle the include field.
0 = function needs standard MATLAB data type; preprocess will extract data from the DataSet Object prior to calling this method and reinsert preprocessed data after the method. Although excluded columns are never extracted and excluded rows are not extracted when performing calibration operations, excluded rows are passed when performing apply and undo operations.
Example:
pp.usesdataset   = 0;
  • caloutputs: Contains the number of values required in field out once a calibrate operation has been performed. This must be set for functions which require a calibrate operation prior to an apply or undo. For example, in the case of mean centering, the mean values stored in the field out are required to apply or undo the operation. Initially, out is an empty cell ({}) but following the calibration, it becomes a single-item cell (length of one). By examining this length, preprocess can determine if a preprocessing structure contains calibration information. The caloutputs field, when greater than zero, indicates to preprocess that it should test the out field prior to attempting an apply or undo. A value of zero for caloutputs also indicates that the method is a row-wise-only method and that the results of preprocessing any row are completely independent of the other rows. This is used by some functions to indicate that the corresponding preprocessing can be done in advance of some iterative processing (e.g. cross-validation.)
Example:
In the case of mean-centering, the length of out should be 1 (one) after calibration:
pp.caloutputs    = 1;
  • keyword: Contains a string which can be used to retrieve the default preprocessing structure for this method. When retrieving a structure by keyword, preprocess ignores any spaces and is case-insensitive. The keyword (or even the description string) can be used in place of any preprocessing structure in calibrate and default calls to preprocess:
pp = preprocess('default','meancenter');
Example:
pp.keyword     = 'Mean Center';
  • category: Contains a string which associates this preprocessing method with other similar methods in the list of preprocessing methods. It is used only to make it easier on the user to locate the method. The string can be one of the standard categories: "Filtering", "Normalization", "Scaling and Centering", "Transformations", "Other", or any other string defining a user-defined category.
Examples:
pp.category = 'Filtering';
pp.category = 'My Methods';
  • userdata: A user-defined field. This field is often used to hold options for the particular method. This field can also be updated following a calibrate operation.
Example:
In savgol several variables are defined with various method options, then they are assembled into userdata:
pp.userdata    = [windowsize order derivative];

Examples

The preprocessing structure used for sample normalization is shown below. The calibrate and apply commands are identical and there is no information that is stored during the calibration phase; thus, caloutputs is 0. The order of the normalization is set in userdata and is used in both calibrate and apply steps.

pp.description = 'Normalize';
pp.calibrate   = {'data = normaliz(data,0,userdata(1));'};
pp.apply       = {'data = normaliz(data,0,userdata(1));'};
pp.undo        = {};
pp.out         = {};
pp.settingsgui   = 'normset';
pp.settingsonadd = 0;
pp.usesdataset   = 0;
pp.caloutputs    = 0;
pp.keyword       = 'Normalize';
pp.userdata      = 2;

The preprocessing structure used for Savitzky-Golay smoothing and derivatives is shown below. In many ways, this structure is similar to the sample normalization structure, except that savgol takes a DataSet Object as input and, thus, usesdataset is set to 1. Also note that because of the various settings required by savgol, this method makes use of the settingsonadd feature to bring up the settings GUI as soon as the method is added.

pp.description     = 'SG Smooth/Derivative';
pp.calibrate     = {'data=savgol(data,userdata(1),userdata(2),userdata(3));'};
pp.apply         = {'data=savgol(data,userdata(1),userdata(2),userdata(3));'};
pp.undo          = {};
pp.out           = {};
pp.settingsgui      = 'savgolset';
pp.settingsonadd    = 1;
pp.usesdataset       = 1;
pp.caloutputs        = 0;
pp.keyword           = 'sg';
pp.userdata          = [ 15 2 0 ];

The following example invokes multiplicative scatter correction (MSC) using the mean of the calibration data as the target spectrum. The calibrate cell here contains two separate operations; the first calculates the mean spectrum and the second performs the MSC. The third input to the MSC function is a flag indicating whether an offset should also be removed. This flag is stored in the userdata field so that the settingsgui (mscorrset) can change the value easily. Note that there is no undo defined for this function.

pp.description         = 'MSC (mean)';
pp.calibrate         = { 'out{1}=mean(data); data=mscorr(data,out{1},userdata);' };
pp.apply               = { 'data = mscorr(data,out{1});' };
pp.undo                = {};
pp.out                 = {};
pp.settingsgui         = 'mscorrset';
pp.settingsonadd         = 0;
pp.usesdataset           = 0;
pp.caloutputs            = 1;
pp.keyword               = 'MSC (mean)';
pp.userdata             = 1;

Adding Custom Methods to Solo

Custom preprocessing methods (created in PLS_Toolbox and Matlab) can be added to the available methods list in Solo. To do so, first create the methods using PLS_Toolbox and Matlab. Next, save the preprocessing methods as individual variables in a single file named preprocatalog.mat. Finally, put this file into the main Solo application folder (either the location of the EXE, or the top-level installation folder) or into Solo's current working directory (whatever that happens to be at the time you are using Solo.) All the methods identified in the MAT file will be made available in the Preprocessing interface.

Note that methods added may NOT use settings GUIs other than those available to existing preprocessing methods as no m-code can be added to Solo.

An alternative method for loading custom methods is to load them directly into the "Applied Methods" list by clicking on the "Load" button at the top of the list in the Preprocess window.

Summary

The Preprocessing Structure provides a generic framework within which a user can organize and automate preprocessing with PLS_Toolbox. Both MATLAB structure and cell data types are used extensively in the Preprocessing Structure. Any custom preprocessing you define yourself can be combined with the numerous methods provided by PLS_Toolbox in order to meet your analysis needs.