Usefulness of imputation for the analysis of incomplete otoneurologic data
Introduction
Missing data values are common in medical data. One obvious source of incomplete information is everyday diagnostic work, where some patients may be quickly diagnosed on the basis of symptoms, while in the difficult cases, all the available tests and measurements are needed to reach the diagnosis. A part of the data is not always recorded, because it is considered unimportant. Data may also be lost because of haste or by omission. Incomplete data rarely hinders the diagnostic work-up, but when the statistical analysis is considered, missing data values are often a major problem, because the standard multivariate analyses usually assume complete data.
Complete-case analysis [1] is a feasible solution when there are few missing values, but otherwise excluding incomplete cases may lead to a considerable loss of information and biased estimates. Available-case analysis [1] uses all cases where a variable is present, but analysis may be difficult, since the number of available cases changes from a variable to a variable. Imputation [1], [2], [3] is a method, where complete data is created by filling in the missing data values. The analysis of imputed data and the presentation of the results are easier in comparison to available-case analysis. The most serious drawback of imputation is the possible distortion of the associations between variables.
This study reports the initial work to treat missing values in otoneurologic data with imputation methods to allow multivariate statistical analysis. The usefulness of imputation was evaluated on the basis of the agreement level of the imputed values and the characteristics of the discriminant functions that were generated from the imputed data. The data was retrieved from the patient database of the otoneurologic expert system ONE [4] which is designed to help physicians in the diagnostics of diseases involving vertigo.
Section snippets
Materials and methods
The database [4], [5], [6] was collected from the vestibular unit of the Helsinki University Central Hospital. The patients, referred to the vestibular laboratory, filled out a questionnaire concerning their symptoms, earlier diseases, accidents, the use of tobacco and alcohol [6]. The information was stored in the patient database of the expert system ONE. In this study we focused on the six largest patient groups with vertigo (N=564), these being vestibular schwannoma (N=128), benign
Results
Means, regression and EM methods agreed on 41–42% of the imputed values and these methods produced the same values as random imputation method for 20–22% of the missing values. Examination of the imputed values by the variables showed that the level of agreement decreased as the value range of a variable increased. Not surprisingly, the imputed values of the 12 nominal variables, 11 of which were dichotomous, had the highest level of agreement. The ordinal variables had a moderate level of
Discussion
The agreement among the means, regression and EM imputation was moderate (41–42%). This is most likely due to the differences in the imputation methods. Means and regression imputation assume that the data are MCAR, while EM imputation generates unbiased data under the less restrictive MAR assumption. When the value domain of an attribute was small, the methods had high agreement. As the number of possible values increased, it was more likely that the imputed values differed. Expectedly, the
Acknowledgements
The corresponding author J. Laurikkala wishes to thank Tampere Graduate School in Information Science and Engineering (TISE) for funding this research.
References (12)
- et al.
Database for vertigo
Otolaryngol. Head. Neck. Surg.
(1995) - et al.
A genetic-based machine learning system to discover the diagnostic rules for female urinary incontinence
Comput. Programs Biomed.
(1998) - et al.
Statistical Analysis With Missing Data
(1987) SPSS Missing Value Analysis 7.5
(1997)Analysis of Incomplete Multivariate Data
(1997)- et al.
Otoneurological expert system
Ann. Otol. Rhinol. Laryngol.
(1996)
Cited by (23)
An interval set model for learning rules from incomplete information table
2012, International Journal of Approximate ReasoningNeural network classification of otoneurological data and its visualization
2008, Computers in Biology and MedicineTechniques for biased data distributions and variable classification with neural networks applied to otoneurological data
2006, Computer Methods and Programs in BiomedicinePrevalence of hearing loss among noise-exposed workers within the services sector, 2006–2015
2020, International Journal of AudiologyPrevalence of hearing loss among noise-exposed workers within the Mining and Oil and Gas Extraction sectors, 2006-2015
2019, American Journal of Industrial Medicine