doi:10.1016/S0167-9473(02)00291-8
Copyright © 2003 Elsevier B.V. All rights reserved.
Using data images for outlier detection
David J. Marchette
,
and Jeffrey L. Solka
Naval Surface Warfare Center, Code B10, 17320 Dahlgren Road, Dahlgren, VA 22448-5100, USA
Available online 14 November 2002.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
The data image has been proposed as a method for visualizing high-dimensional data. The idea is to map the data into an image, by using gray-scale (or color) values to indicate the magnitude of each variate. Thus, the image for a data set of size n and dimension d is a d×n image, where the columns correspond to observations and the rows to variates. We consider the application of this idea to the detection of outliers, providing a simple visualization technique that highlights outliers and clusters within the data.
Author Keywords: Data image; Color histogram; Interpoint distance matrix; hierarchical clustering; Outlier detection
Fig. 1. Data image for the Setosa and Versicolor species of irises. The variables are: sepal width and length and petal width and length, as indicated by the abbreviations.
Fig. 2. Scatter plot of data lying on a line, with one outlier off the line.
Fig. 3. Data images of the interpoint distance matrix using Euclidean distance(left) and Mahalanobis distance (right) for the data depicted in
Fig. 2. The dark “v” in the lower left and upper right corners of the plots are indicative of potential outliers.
Fig. 4. Body and brain weight from [
Rousseeuw and Leroy 1987], originally from [
Jerison 1973] and [
Weisberg 1980]. The five outliers are numbered in both plots, and correspond to: brachiosaurus, diplodocus, triceratops, Asia elephant and Africa elephant.
Fig. 5. Stackloss data from [
Rousseeuw and Leroy 1987], originally [
Brownlee 1965]. The three outliers detected in the data image are depicted with triangles in the pairs plot.
Fig. 6. Data image for the Mahalanobis distance matrix for the stackloss data of
Fig. 5.
Fig. 7. Data image for the Mahalanobis distance matrix for the stackloss data of
Fig. 5, where the covariance in the Mahalanobis calculation is constructed using observations 4–21.
Fig. 8. An artificial data set ([
Rousseeuw and Leroy 1987]), originally from [
Hawkins 1984].
Fig. 9. An artificial elliptical data set.
Fig. 10. Data images for the Euclidean (left) and Mahalanobis (right) distance matrices for the data in
Fig. 9.
Fig. 11. Pairs plot and data image of the interpoint distance matrix for a five-dimensional data set. 100 observations were drawn uniformly on the five-dimensional sphere, and one observation (indicated by a “+” in the pairs plot) was placed at the origin.
Fig. 12. Artificial nose data, TCE present. The outliers are TCE in chloroform.