DataSet Object Specifications

From Eigenvector Research Documentation Wiki
Revision as of 17:00, 3 December 2012 by imported>Scott (→‎Indexing into DataSet Objects)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The design of the DataSet Object included outside input from data analysts, users, instrument manufacturers, and software developers. Not all of the suggested properties and methods were included in the present version, however many may be included in future versions. Users are encouraged to make suggestions for future versions. The following considerations were implemented in the present version:

1. DataSet objects contain single blocks i.e. single data arrays. The data arrays can be two-mode (two-way) or multi-mode (N-way). In this document the number of dimensions (i.e. the return value from the NDIMS function in MATLAB) will be referred to as the number of modes (Kiers, 2000). Thus, a two-way data matrix will consist of 2 modes. The first mode corresponds to rows and the second mode to columns (a third mode is often referred to as tubes or slabs).

DataSet objects can also contain cell arrays in the data field (the .type field is 'batch'). The cell contents can be used for e.g. variable length batch data. It should be noted that this method of data storage does not support all functionality of the standard DataSet object.

2. There can be multiple sample and variable labels, class identifiers, and numerical scales for each mode (e.g. pinot noir, cabernet, … or spectra, process, etc.).

3. Labels, class identifiers, and numerical scales can be given names corresponding to each label set and mode.

4. Each mode can have multiple corresponding mode titles (e.g. wine or measurement). There are as many titles as modes and usually measurement units are included.

5. The DataSet object is open and editable, however suggestions for changes and enhancements should be made to EVRI.

6. The DataSet object can be extracted so that all fields become standard MATLAB class objects (e.g. double, or cell) in the workspace.

Basic DataSet Object Field Descriptions

The following gives a general description of the fields in a DataSet object and is only meant to give an orientation to the different concepts involved in the DataSet object. See Section 4 to learn more about the MATLAB commands used to create a DataSet object.

Consider a table of data which includes the amount of alcohol consumed by five different countries. In addition, we also have a textual class indicating which continent the given country is on and an otherwise unspecified "Sample Group" which indicates another (otherwise unspecified) sub-sampling of the 5 countries.

Country Continent Sample Group Liquor Wine Beer
France Europe 1 2.5 63.5 40.1
Italy Europe 1 0.9 58 25.1
UK Europe 2 1.5 12.2 100
U.S.A. North America 2 2 8.9 87.8
Mexico South America 3 0.8 0.2 50.4

Data

The Data field of a DataSet object contains the numerical component of a given data set. In this example, the three columns on the right are the actual numerical data (columns 4 through 6, rows 2 through 6) and would be stored in the data field. In nearly all cases, rows are observations or objects and columns are variables (one observes a variable for a number of objects). Note that although the column "Sample Group" contains numerical data, it is contextual data – that is, data which gives context to the samples – not true numerical data. You can tell it is contextual because we could just as easily replace the numbers with descriptive strings ("Primary", "Secondary", etc) and not influence the interpretation (in fact, using strings, such as was done in the "Continent" column, would be more interpretable and probably preferable!)

Labels

A DataSet object also allows for Label sets to be associated with each mode (rows or columns, in this example). Labels are usually not used by algoirthms and are used only to give contextual information on individual rows or columns. Here, the first column is a set of labels indicating the name of each country. This item would be stored as "Labels" for the rows of the DataSet object. The "Label Name" for this column would be "Countries".

Likewise, the first row (just above the numbers, comprising: "Liquor", "Wine", "Beer") are labels for the columns of the DataSet object and would be stored as Labels for columns. The Label Name for these labels would be "Alcohol Type" (or something similar). Note that you can have any number of Label "sets" for a given mode.

Classes

Classes, like labels, are contextual information for the data and give information on similarities between different rows or columns. Although the second column (titled here as "Continent") could be stored as a second set of row labels, it is more appropriate to use this particular column as a Class set. Classes are very similar to labels with the notable difference that, because classes usually describe how the data can be split into sub-groups, a given string class is often used for more than one sample (note that "Europe" appears for each of the first three countries). Compare this to labels which usually give unique information for a given item. Classes also provide an easy way to select and/or modify the group of rows or columns as a whole.

Another difference between classes and labels is that classes can be referred to using either strings or numerical values. If you assign classes using numerical values, strings will automatically be created to describe the different class groups. See the description of the .class and .classid fields in the next section for more information. For example, in this table, the "Sample Group" column would be most appropriate as a class set for rows and could be assigned as a numerical class.

As with labels, classes can also have a "class name" which is a general descriptor for the class. In this example, it would be appropriate to give a class name of "Continent" for the first class set and "Sample Group" for the second class set. Note that, although it is possible to do, there are no classes given for the columns in this example.

Axisscales

Another type of contextual information which can be stored in a DataSet object is an axis scale. Axisscales are used when samples or variables have a natural order and numerical relationship. Although this example does not contain such ordered data, an example would be when measuring something as a function of time. Each row (observation) would have a time-stamp associated with it. These values would be stored in the axisscale field (note: because the field name does not have a space in it, this document will refer to axis scales without the space, "axisscale").

In addition to axisscale, there is also a DataSet object field named axistype which allows the categorization of the relationship between adjacent items in a given mode. Often used in conjunction with axisscale, axistype specifies whether consecutive items (e.g. columns) should be considered "discrete", unrelated items; "stick" items which are unrelated but have reference to zero in the y-scale; "continuous" items which are generally accepted as individual points on a continuous surface or line; or "none" which indicates no relationship has been established. Some plotting commands will use the axistype information to determine how plots of the given data should be generated.

Include

One of the key features of the DataSet object is the ability to "soft-delete" an item. This is accomplished using the Include field of the DataSet object. The include field simply lists all the items which should be considered when working with the given dataset. In this example, we might want to ignore the non-European countries which we would do by setting the include field of the DataSet object to include only the first three rows (ignore rows 4 and 5, U.S.A. and Mexico)

Title

In addition to the fields which include one entry for each row or column of the table, there is also a generic "title" field which includes a single description for the entire mode. This is often used to describe what the given mode is being used for, such as "Samples" (mode 1, i.e. rows, for the table above).

Introduction to Advanced DataSet Object Features

Indexing into DataSet Objects

Users familiar with MATLAB know that any subset of an n-dimensional matrix can be retrieved from the whole matrix using standard parenthesis and indexing. DataSet objects are handled identically. For example, to obtain a DataSet object comprised of only the 3rd row of a two-way DataSet object, x, the following command would be used:

sub = x(3,:);

The returned variable will be a DataSet object with all the contextual data for the given subset of the original DataSet object.

In addition to this standard indexing, a special indexing is available which makes use of the labels defined for a DataSet. The DataSet name, followed by a period and the label of any row, column, or other n-dimensional item, will return that single item in a DataSet object. For example, given a DataSet which has a column with the label "sensor", the following would extract that column as a new DataSet object:

sub = x.sensor;

The only restrictions to this style of indexing are:

1) If the label contains any spaces or other MATLAB reserved character (e.g. mathematical symbols such as plus or minus), you must enclose the label in parenthesis and single quotes:

sub = x.('sensor number 2');

2) The label may not be the same as any standard DataSet object field or method.

Multiway Data

DataSet objects can contain data structures which are more complex than simple tables. One example mentioned above is Multi-way arrays (data which is most appropriately described using 3 or more dimensions). These cases are direct extensions of the two-way example given above. Each mode has its own labels, classes, and axisscales.

Image Data

Another example of complex data which can be stored in a DataSet object is Image data. By setting the type field of a DataSet object to be "image", one can store image data where each pixel (or voxel, in the case of volumetric 3D images) is an observation. Such DataSet objects store the data as "unfolded" (where all the spatial positions are stored in a single mode) and, thus, require information on how to arrange the observations back into the spatial image. This is accomplished with the imagemode and imagesize fields.

Batch Data

There is intermediate support for batch data which is described as a series of two-way tables (or higher dimensional, if desired) of different lengths. In this document, the features available for this type of data are referred to as type=batch DataSet objects.

Detailed Discussion of DataSet Object Properties (Fields)

The following is a list of all properties (fields) of the DataSet object. Details of these fields can be found in the entry: DataSet Object Fields

.name
.type
.author
.date
.moddate
.data
.size
.sizestr
.imagemode
.imagesize
.imagesizestr
.foldedsize
.foldedsizestr
.imagedata
.imagedataf
.label
.labelname
.axisscale
.axisscalename
.axistype
.imageaxisscale
.imageaxisscalename
.imageaxisscaletype
.title
.titlename
.class
.classid
.classlookup
.classname
.include
.uniqueid
.description
.history
.userdata