User Defined Preprocessing
Each preprocessing method available in the preprocess function is defined using a preprocessing structure. The standard methods are defined in the preprocatalog file, and additional user-defined methods can be defined in the preprouser file. The methods defined in these two files are available to all functions making use of the preprocess function. See the preprouser file for examples of defining and adding methods to the available methods catalog. The fields in a preprocessing structure include:
- description - textual description of the particular method.
- calibrate - cell containing the line(s) to execute during a calibration operation.
- apply - cell containing the line(s) to execute during an apply operation.
- undo - cell containing the line(s) to execute during an undo operation.
- out - cell used to hold calibration-phase results for use in apply or undo.
- settingsgui - function name to execute when the "Settings" button is pressed in the GUI.
- settingsonadd - Boolean: 1 = automatically bring up settings GUI when method is "added".
- usesdataset - Boolean: indicates if this method should be passed a DataSet Object (1) or a raw matrix (0).
- caloutputs - integer: number of expected items in field out after calibration has been performed.
- keyword - text string which users can pass to preprocess to obtain this structure.
- userdata - user-defined variable often used to store method options.
Detailed Field Descriptions
- description: Contains a short (1-2 word) description of the method. The value will be displayed in the GUI and can also be used as a string keyword (see also keyword) to refer to this method. For example:
- pp.description = 'Mean Center';
- calibrate, apply, undo: Contain a single cell of one or more command strings to be executed by preprocess when performing calibration, apply or undo operations. Calibrate actions are done on original calibration data, whereas apply is done on new data. The undo action is used to remove preprocessing from previously preprocessed data, although it may be undefined for certain methods. If this is the case, the undo string should be an empty cell.
The command strings should be valid MATLAB commands. Each command will be executed inside the preprocess environment in which the following variables are available:
- data: The actual data on which to operate and in which to return modified results. If the field usesdataset is 1 (one) then data will be a DataSet Object. Otherwise, data will be a standard MATLAB variable which has been extracted from the DataSet Object. In the latter case, the extracted data will not contain any excluded columns from the original DataSet Object. In addition, in calibrate mode rows excluded in the original DataSet Object will not be passed. All rows, excluded or not, are passed in apply and undo modes.
- If writing a DataSet Object-enabled preprocessing function, it is expected that the function will calibrate using only included rows but apply and undo the preprocessing to all rows.
- out: Contents of the preprocessing structure field out. Any changes will be stored in the preprocessing structure for use in subsequent apply and undo commands.
- userdata: Contents of the preprocessing structure field userdata. Any changes will be stored in the preprocessing structure for later retrieval.
- Read-only variables: Do not change the value of these variables.
- include: Contents of the original DataSet Object's include field.
otherdata: Cell array of additional inputs to preprocess which follow the data. For example, these inputs are used by PLS_Toolbox regression functions to pass the y-block for use in methods which require that information.
- originaldata: Original DataSet Object unmodified by any preprocessing steps. For example, originaldata can be used to retrieve axis scale or class information even when usesdataset is 0 (zero) and, thus, data is the extracted data.
In order to assure that all samples (rows) in the data have been appropriately preprocessed, an apply command is automatically performed following a calibrate call. Note that excluded variables are replaced with NaN.
- Examples:
- The following calibrate field performs mean-centering on data, returning both the mean-centered data as well as the mean values which are stored in out{1}:
pp.calibrate = { '[data,out{1}] = mncn(data);' };
The following apply and undo fields use the scale and rescale functions to apply and undo the previously determined mean values (stored by the calibrate operation in out{1}) with new data:
pp.apply = { 'data = scale(data,out{1});' }; pp.undo = { 'data = rescale(data,out{1});' };
- out: Contains non-data outputs from calibrations. This output is usually information derived from the calibration data which is required to apply the preprocessing to new data. This field will be updated following a calibrate call, and the entire preprocessing structure will be output. For example, when autoscaling, the mean and standard deviation of all columns are stored as two entries in out{1} and out{2}. See the related field caloutputs. Usually, this field is empty in the default, uncalibrated structure.
- settingsgui: Contains the name of a graphical user interface (GUI) function which allows the user to set options for this method. The function is expected to take as its only input a standard preprocessing structure, from which it should take the current settings and output the same preprocessing structure modified to meet the user's specifications. Typically, these changes are made to the userdata field, then the commands in the calibrate, apply and undo fields will use those settings.
- The design of GUIs for selection of options is beyond the scope of this document and the user is directed to autoset.m and savgolset.m, both of which use GUIs to modify the userdata field of a preprocessing structure. Example:
pp.settingsgui = 'autoset';
- settingsonadd: Contains a Boolean (1=true, 0=false) value indicating if, when the user selects and adds the method in the GUI, the method's settingsgui should be automatically invoked. If a method usually requires the user to make some selection of options, settingsonadd=1 will guarantee that the user has had an opportunity to modify the options or at least choose the default settings.
- Example:
pp.settingsonadd = 1;
- usesdataset: Contains a Boolean (1=true, 0=false) value indicating if the method is capable of handling DataSet Objects or can only handle standard MATLAB data types (double, uint8, etc).
- 1 = function can handle DataSet Objects; preprocess will pass entire DataSet Object. It is the responsibility of the function(s) called by the method to appropriately handle the include field.
- 0 = function needs standard MATLAB data type; preprocess will extract data from the DataSet Object prior to calling this method and reinsert preprocessed data after the method. Although excluded columns are never extracted and excluded rows are not extracted when performing calibration operations, excluded rows are passed when performing apply and undo operations.
- Example:
pp.usesdataset = 0;
- caloutputs: Contains the number of values expected in field out if a calibrate operation has been performed, for functions which require a calibrate operation prior to an apply or undo. For example, in the case of mean centering, the mean values stored in the field out are required to apply or undo the operation. Initially, out is an empty cell ({}) but following the calibration, it becomes a single-item cell (length of one). By examining this length, preprocess can determine if a preprocessing structure contains calibration information. The caloutputs field, when greater than zero, indicates to preprocess that it should test the out field prior to attempting an apply or undo.
- Example:
- In the case of mean-centering, the length of out should be 1 (one) after calibration:
pp.caloutputs = 1;
- Example:
- keyword: Contains a string which can be used to retrieve the default preprocessing structure for this method. When retrieving a structure by keyword, preprocess ignores any spaces and is case-insensitive. The keyword (or even the description string) can be used in place of any preprocessing structure in calibrate and default calls to preprocess:
pp = preprocess('default','meancenter');
- Example:
pp.keyword = 'Mean Center';
- userdata: A user-defined field. This field is often used to hold options for the particular method. This field can also be updated following a calibrate operation.
- Example:
- In savgol several variables are defined with various method options, then they are assembled into userdata:
pp.userdata = [windowsize order derivative];
- Example:
Examples
The preprocessing structure used for sample normalization is shown below. The calibrate and apply commands are identical and there is no information that is stored during the calibration phase; thus, caloutputs is 0. The order of the normalization is set in userdata and is used in both calibrate and apply steps.
pp.description = 'Normalize'; pp.calibrate = {'data = normaliz(data,0,userdata(1));'}; pp.apply = {'data = normaliz(data,0,userdata(1));'}; pp.undo = {}; pp.out = {}; pp.settingsgui = 'normset'; pp.settingsonadd = 0; pp.usesdataset = 0; pp.caloutputs = 0; pp.keyword = 'Normalize'; pp.userdata = 2;
The preprocessing structure used for Savitsky-Golay smoothing and derivatives is shown below. In many ways, this structure is similar to the sample normalization structure, except that savgol takes a DataSet Object as input and, thus, usesdataset is set to 1. Also note that because of the various settings required by savgol, this method makes use of the settingsonadd feature to bring up the settings GUI as soon as the method is added.
pp.description = 'SG Smooth/Derivative'; pp.calibrate = {'data=savgol(data,userdata(1),userdata(2),userdata(3));'}; pp.apply = {'data=savgol(data,userdata(1),userdata(2),userdata(3));'}; pp.undo = {}; pp.out = {}; pp.settingsgui = 'savgolset'; pp.settingsonadd = 1; pp.usesdataset = 1; pp.caloutputs = 0; pp.keyword = 'sg'; pp.userdata = [ 15 2 0 ];
The following example invokes multiplicative scatter correction (MSC) using the mean of the calibration data as the target spectrum. The calibrate cell here contains two separate operations; the first calculates the mean spectrum and the second performs the MSC. The third input to the MSC function is a flag indicating whether an offset should also be removed. This flag is stored in the userdata field so that the settingsgui (mscorrset) can change the value easily. Note that there is no undo defined for this function.
pp.description = 'MSC (mean)'; pp.calibrate = { 'out{1}=mean(data); data=mscorr(data,out{1},userdata);' }; pp.apply = { 'data = mscorr(data,out{1});' }; pp.undo = {}; pp.out = {}; pp.settingsgui = 'mscorrset'; pp.settingsonadd = 0; pp.usesdataset = 0; pp.caloutputs = 1; pp.keyword = 'MSC (mean)'; pp.userdata = 1;
Summary
The Preprocessing Structure provides a generic framework within which a user can organize and automate preprocessing with PLS_Toolbox. Both MATLAB structure and cell data types are used extensively in the Preprocessing Structure. Any custom preprocessing you define yourself can be combined with the numerous methods provided by PLS_Toolbox in order to meet your analysis needs.