Usefulness of imputation for the analysis of incomplete otoneurologic data

https://doi.org/10.1016/S1386-5056(00)00090-3Get rights and content

Abstract

The usefulness of imputation in the treatment of missing values of an otoneurologic database for the discriminant analysis was evaluated on the basis of the agreement of imputed values and the analysis results. The data consisted of six patient groups with vertigo (N=564). There were 38 variables and 11% of the data was missing. Missing values were filled in with the means, regression and Expectation-Maximisation (EM) imputation methods and a random imputation method provided the baseline results. Means, regression and EM methods agreed on 41–42% of the imputed missing values. The level of agreement between these and the random method was 20–22%. Despite the moderate agreement between the means, regression and EM methods, the discriminant functions were similar and accurate (prediction accuracy 83–99%). The discriminant functions obtained from the randomly imputed data were also accurate having prediction accuracy 88–97%. Imputation seems to be a useful method for treating the missing data in this database. However, a lot of data was missing in otoneurologic tests, which are likely to be of less importance in the diagnosis of vertiginous patients. Consequently, the disagreement of the methods did not affect clearly the discriminant analysis, and, therefore, future research requires more complete data and advanced imputation methods.

Introduction

Missing data values are common in medical data. One obvious source of incomplete information is everyday diagnostic work, where some patients may be quickly diagnosed on the basis of symptoms, while in the difficult cases, all the available tests and measurements are needed to reach the diagnosis. A part of the data is not always recorded, because it is considered unimportant. Data may also be lost because of haste or by omission. Incomplete data rarely hinders the diagnostic work-up, but when the statistical analysis is considered, missing data values are often a major problem, because the standard multivariate analyses usually assume complete data.

Complete-case analysis [1] is a feasible solution when there are few missing values, but otherwise excluding incomplete cases may lead to a considerable loss of information and biased estimates. Available-case analysis [1] uses all cases where a variable is present, but analysis may be difficult, since the number of available cases changes from a variable to a variable. Imputation [1], [2], [3] is a method, where complete data is created by filling in the missing data values. The analysis of imputed data and the presentation of the results are easier in comparison to available-case analysis. The most serious drawback of imputation is the possible distortion of the associations between variables.

This study reports the initial work to treat missing values in otoneurologic data with imputation methods to allow multivariate statistical analysis. The usefulness of imputation was evaluated on the basis of the agreement level of the imputed values and the characteristics of the discriminant functions that were generated from the imputed data. The data was retrieved from the patient database of the otoneurologic expert system ONE [4] which is designed to help physicians in the diagnostics of diseases involving vertigo.

Section snippets

Materials and methods

The database [4], [5], [6] was collected from the vestibular unit of the Helsinki University Central Hospital. The patients, referred to the vestibular laboratory, filled out a questionnaire concerning their symptoms, earlier diseases, accidents, the use of tobacco and alcohol [6]. The information was stored in the patient database of the expert system ONE. In this study we focused on the six largest patient groups with vertigo (N=564), these being vestibular schwannoma (N=128), benign

Results

Means, regression and EM methods agreed on 41–42% of the imputed values and these methods produced the same values as random imputation method for 20–22% of the missing values. Examination of the imputed values by the variables showed that the level of agreement decreased as the value range of a variable increased. Not surprisingly, the imputed values of the 12 nominal variables, 11 of which were dichotomous, had the highest level of agreement. The ordinal variables had a moderate level of

Discussion

The agreement among the means, regression and EM imputation was moderate (41–42%). This is most likely due to the differences in the imputation methods. Means and regression imputation assume that the data are MCAR, while EM imputation generates unbiased data under the less restrictive MAR assumption. When the value domain of an attribute was small, the methods had high agreement. As the number of possible values increased, it was more likely that the imputed values differed. Expectedly, the

Acknowledgements

The corresponding author J. Laurikkala wishes to thank Tampere Graduate School in Information Science and Engineering (TISE) for funding this research.

References (12)

There are more references available in the full text version of this article.

Cited by (23)

  • An interval set model for learning rules from incomplete information table

    2012, International Journal of Approximate Reasoning
View all citing articles on Scopus
View full text