Umap: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 56: Line 56:


* '''plots''': [ 'none' | {'final'} ], governs level of plotting.
* '''plots''': [ 'none' | {'final'} ], governs level of plotting.
* '''warnings''' : [{'off'} | 'on'], Silence or display any potential Python warnings.


* '''preprocessing''': {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),
* '''preprocessing''': {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),
Line 65: Line 66:
*  '''spread''': [ {'1'} ], The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. This parameter should be set relative to the '''min_dist''' parameter.
*  '''spread''': [ {'1'} ], The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. This parameter should be set relative to the '''min_dist''' parameter.


*  '''n_components''': [ {'2'} ] The dimensionality of the reduced space.
*  '''n_components''': [ {'2'} ], The dimensionality of the reduced space.


*  '''metric''': [ {'euclidean'} | 'manhattan' | 'cosine' | 'mahalanobis' ] The metric used to calculate distance between data samples.
*  '''metric''': [ {'euclidean'} | 'manhattan' | 'cosine' | 'mahalanobis' ], The metric used to calculate distance between data samples.


*  '''random_state''': [ {'1'} ] Random seed number. Set this to a number for reproducibility.
*  '''random_state''': [ {'1'} ], Random seed number. Set this to a number for reproducibility.


* '''blockdetails''' : [ {'standard'} | 'all' ]: Extent of predictions and raw residuals included in model. 'standard' = none, 'all' x-block.
* '''blockdetails''' : [ {'standard'} | 'all' ], Extent of predictions and raw residuals included in model. 'standard' = none, 'all' x-block.


*  '''compression''': [ {'none'} | 'pca' ] Type of data compression to perform on the x-block prior to calculating or applying the UMAP model. 'pca' uses a simple PCA model to compress the information.
*  '''compression''': [ {'none'} | 'pca' ], Type of data compression to perform on the x-block prior to calculating or applying the UMAP model. 'pca' uses a simple PCA model to compress the information.


*  '''compressncomp''': [ {'2'} ] Number of latent variables (or principal components to include in the compression model).
*  '''compressncomp''': [ {'2'} ], Number of latent variables (or principal components to include in the compression model).
*  '''compressmd''': [ {'yes'} | 'no' ] Use Mahalnobis Distance corrected.
*  '''compressmd''': [ {'yes'} | 'no' ], Use Mahalnobis Distance corrected.


The default options can be retrieved using: options = umap('options');.
The default options can be retrieved using: options = umap('options');.

Revision as of 09:24, 9 September 2021

Purpose

Perform Unsupervised Uniform Manifold Approximation and Projection

Synopsis

model = umap(x,options); %identifies model (calibration step)
pred = umap(x,model); %applies model to new data (validation step)
umap %Launches Analysis window with UMAP selected

Please note that the recommended way to build and apply a UMAP model from the command line is to use the Model Object. Please see this wiki page on building and applying models using the Model Object.

Description

UMAP is one of many tools to visualize high-dimensional data. Our software uses the Python (umap-learn package) implementation of the UMAP method. Their documentation can be found here: https://umap-learn.readthedocs.io/en/latest/. UMAP will model the input data as a "fuzzy" topological structure. The embeddings will come from a lower dimensional space that most closely resembles the topological structure of the original space. The embeddings will return n_component embeddings. E.g. for an M by N matrix, if the dimension of the embedded space (n_component) is K the embeddings will be of shape M by K.

Note: The PLS_Toolbox Python virtual environment must be configured in order to use this method. Find out more here: Python configuration.

This implementation ONLY perform Unsupervised Learning. Supervised UMAP Learning will be released at a later time.

Inputs

  • x = X-block (2-way array class "double" or "dataset").

Optional Inputs

  • model = existing UMAP model, onto which new data x is to be applied.
  • options = discussed below.

Outputs

The output of UMAP is a model structure with the following fields (see Standard Model Structure for additional information):

  • modeltype: 'UMAP',
  • datasource: structure array with information about input data,
  • date: date of creation,
  • time: time of creation,
  • info: additional model information,
  • description: cell array with text description of model, and
  • detail: sub-structure with additional model details and results.

Note: The embeddings of the UMAP model can be found under detail.umap.embeddings.

Options

options = a structure array with the following fields:

  • display: [ 'off' | {'on'} ], governs level of display to command window,
  • plots: [ 'none' | {'final'} ], governs level of plotting.
  • warnings : [{'off'} | 'on'], Silence or display any potential Python warnings.
  • preprocessing: {[]}, cell array containing a preprocessing structure (see PREPROCESS) defining preprocessing to use on the data (discussed below),
  • n_neighbors: [ {'15'} ], Number of neighbors to consider. Controls the balance between local and global structure in the data.
  • min_dist: [ {'30'} ], Minimum distance from data points in the low dimensional representation. Low values result in more clustered/clumped embeddings while a larger value results in a more even dispersal of points. This parameter should be set relative to the spread parameter.
  • spread: [ {'1'} ], The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. This parameter should be set relative to the min_dist parameter.
  • n_components: [ {'2'} ], The dimensionality of the reduced space.
  • metric: [ {'euclidean'} | 'manhattan' | 'cosine' | 'mahalanobis' ], The metric used to calculate distance between data samples.
  • random_state: [ {'1'} ], Random seed number. Set this to a number for reproducibility.
  • blockdetails : [ {'standard'} | 'all' ], Extent of predictions and raw residuals included in model. 'standard' = none, 'all' x-block.
  • compression: [ {'none'} | 'pca' ], Type of data compression to perform on the x-block prior to calculating or applying the UMAP model. 'pca' uses a simple PCA model to compress the information.
  • compressncomp: [ {'2'} ], Number of latent variables (or principal components to include in the compression model).
  • compressmd: [ {'yes'} | 'no' ], Use Mahalnobis Distance corrected.

The default options can be retrieved using: options = umap('options');.

PREPROCESSING

The preprocessing field can be empty [] (indicating that no preprocessing of the data should be used), or it can contain a preprocessing structure output from the PREPROCESS function. For example options.preprocessing = {preprocess('default', 'autoscale')}. This information is echoed in the output model in the model.detail.preprocessing field.

See Also

tsne, pca