doi:10.1016/S0167-9473(02)00280-3
Copyright © 2002 Elsevier B.V. All rights reserved.
Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator
a Department of Mathematics, Pomona College, 610 N. College Ave., Claremont, CA 91711, USA
b Center for Image Processing and Integrated Computing, University of California at Davis, USA
Received 1 January 2002;
revised 1 August 2002.
Available online 24 October 2002.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Mahalanobis-type distances in which the shape matrix is derived from a consistent high-breakdown robust multivariate location and scale estimator can be used to find outlying points. Hardin and Rocke (http://www.cipic.ucdavis.edu/˜dmrocke/preprints.html) developed a new method for identifying outliers in a one-cluster setting using an F distribution. We extend the method to the multiple cluster case which gives a robust clustering method in conjunction with an outlier identification method. We provide results of the F distribution method for multiple clusters which have different sizes and shapes.
Author Keywords: Minimum covariance determinant; Robust clustering; Outlier detection
Table 1. Clean data (2 and 3 clusters)

Each entry represents the percent of simulated data that was misclassified according to a specified cutoff and percentage. The first column reports data that come from two populations, and the second column reports data that come from three populations. Balanced refers to data that consist of equal sized clusters; results were similar for unbalanced clusters. These data were generated as clusters of multivariate normal data with no contamination. The analysis was done using the MCD estimates. Results were equivalent for cutoffs at the 5% and 0.1% level.
Table 2. Contaminated data (2 clusters)

Each entry represents the percent of simulated data that was misclassified according to a specified cutoff and percentage. The first column of tables reports the type I error for the procedure, and the second column reports the type II error. Cluster refers to data that was contaminated by generating clusters of multivariate normal data with a cluster of contamination of size 20% of the smallest clean cluster. Radial refers to data that was contaminated by generating radial outliers of size 20% of the smallest clean cluster. Diffuse refers to data that was contaminated by generating diffuse outliers of size 20% of the smallest clean cluster. The analysis was done using the MCD estimates. Results were equivalent for cutoffs at the 5% and 0.1% level.