Faq why get missing data warning: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Lyle
(Created page with "===Issue:=== Why do I get the warning/notice "Missing Data Found - Replacing with "best guess" from existing model. Results may be affected by this action." ===Possible Solu...")
 
imported>Lyle
No edit summary
Line 12: Line 12:


If data is missing from a lot of samples, you don't have any other real option. There are some algorithms which use weighting to ignore missing values. See, for example, the tucker and tld functions.  
If data is missing from a lot of samples, you don't have any other real option. There are some algorithms which use weighting to ignore missing values. See, for example, the tucker and tld functions.  


'''Still having problems? Please contact our helpdesk at [mailto:helpdesk@eigenvector.com helpdesk@eigenvector.com]'''
'''Still having problems? Please contact our helpdesk at [mailto:helpdesk@eigenvector.com helpdesk@eigenvector.com]'''


[[Category:FAQ]]
[[Category:FAQ]]

Revision as of 13:16, 5 December 2018

Issue:

Why do I get the warning/notice "Missing Data Found - Replacing with "best guess" from existing model. Results may be affected by this action."

Possible Solutions:

The warning comes because you have NaN (Not a Number) in your data somewhere. NaN is "missing data" - data points you do not have values for. Sometimes this will happen with certain preprocessing, but the most likely cause is that when you imported your data, it had some missing data points.

The implication of the warning is that, to build a model the algorithm requires values for all variables and samples. To handle this problem, PLS_Toolbox uses a data imputation algorithm which looks to replace missing data by estimating a value for the missing data points, building a PCA model of all the data, and then using that model to replace the missing data points again (this is then repeated until the replaced values converge on unchanging values). This procedure is not perfect and can still lead to samples which have high leverage or residuals (i.e. samples that are outliers) but if you have lots of missing data, it may be the only reasonable approach.

If data is missing in only a couple of samples, you could exclude those samples, build a model from the remaining data. (You can also later use the PLS_Toolbox "replace" function to estimate the missing values for the excluded samples using that model and then rebuild the model with all data - this may give a better estimate than the PCA imputation method gives.)

If data is missing from a lot of samples, you don't have any other real option. There are some algorithms which use weighting to ignore missing values. See, for example, the tucker and tld functions.


Still having problems? Please contact our helpdesk at helpdesk@eigenvector.com