Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Control Chart Based on Cluster-Regression Adjustment for Retrospective Monitoring of Individual Characteristics

Abstract

The tendency for experimental and industrial variables to include a certain proportion of outliers has become a rule rather than an exception. These clusters of outliers, if left undetected, have the capability to distort the mean and the covariance matrix of the Hotelling’s T2 multivariate control charts constructed to monitor individual quality characteristics. The effect of this distortion is that the control chart constructed from it becomes unreliable as it exhibits masking and swamping, a phenomenon in which an out-of-control process is erroneously declared as an in-control process or an in-control process is erroneously declared as out-of-control process. To handle these problems, this article proposes a control chart that is based on cluster-regression adjustment for retrospective monitoring of individual quality characteristics in a multivariate setting. The performance of the proposed method is investigated through Monte Carlo simulation experiments and historical datasets. Results obtained indicate that the proposed method is an improvement over the state-of-art methods in terms of outlier detection as well as keeping masking and swamping rate under control.

Introduction

Following Staudte and Sheather [1], an outlier can be defined as an observation that is far from the bulk of the data. Similarly, Hampel et al. [2] defined outliers to be observations which deviate from the pattern set by majority of the data. Since this outlying observations depart from the general trend dictated by the bulk of the data, its identification in the dataset plays an important role in many practical applications. The dataset may contain outliers with abnormal values that are arbitrarily larger or smaller than the normal observations. This could lead to misjudgment of analysis in areas such as regression analysis, analysis of variance, statistical process control (SPC) and profile monitoring. Therefore, outlier identification is an important task prior to data analysis especially in monitoring product quality of a production process. Historical data often suffer from transcription errors and problems in data quality. These errors make historical data prone to outliers.

According to Montgomery et al. [3], outliers are identified as observations that produce residuals that are considerably larger in absolute values than the others; say, three or four sigma from the mean. This idea can be applied in SPC. Consider a situation where the objective of a multivariate control chart is to detect the presence of assignable causes of variation in multivariate quality characteristics. In particular, for a retrospective phase I analysis of a historical dataset, the objective is two fold: (i) to identify shifts in the mean vector that might distort the estimation of the in-control mean vector and covariance matrix; (ii) to identify and eliminate multivariate outliers. The purpose of seeking an in-control subset of historical dataset is to estimate in-control parameters for use in a phase II analysis. To achieve these objectives, individual retrospective multivariate control charts are constructed to determine if in a multivariate sense, the already obtained, sequentially ordered data points X = {xij}i = 1, ⋯, n, j = 1, ⋯, p ⊂ ℝp are stable, that is, free of outliers, upsets or shifts. Fan et al. [4] refer to this kind of analysis as a retrospective Phase I analysis of a historical data set (HDS). The sample mean vector and covariance matrix are estimated from X. From these estimates, Hotelling’s T2 chart is constructed and used to flag outliers and mean shifts in the process.

According to Marcus and Pokojovy [5], given the multivariate quality characteristic, X, the Hotelling’s T2 statistic computed for {xij}i = 1, ⋯, n, j = 1, ⋯, p ⊂ ℝp at time i is (1) where mx is the sample mean given by (2) and Σ^ is the sample covariance matrix given by (3) If Ti2 exceeds the upper control limit (UCL) defined as (4) then xi is declared as an outlier or a special cause (out-of-control) is assumed to have occurred at sample number i or before it. Eq (4) assumes the xi’s are independent and come from a multivariate normal distribution. However, the T2 are correlated because they each depend on the same mx and Σ^ (Mansion and Young [6]).

The classical methods for detecting mean shift and outliers such as the one described in Eq (1) are powerful when the dataset contain only one outlier. Nevertheless, the power of these techniques reduces significantly when more than one outlying observations are present in the datasets. This loss of power is typically due to what can be referred to as the masking and swamping problems. Furthermore, these techniques do not always succeed in identifying mean shift and spurious outliers because they are based on several assumptions which are rarely met. Thus, they are affected by the observations that they are supposed to detect. Therefore, a technique devoid of these problems is desirable.

Let Di(mx,Σ^mx)=f(ximx,Σ^),i=1,,n be a suitable metric for measuring the distance between the ith observation xi and a location (mean) estimator mx in Eq (2), relative to a measure of dispersion (covariance matrix), Σ^ in Eq (3). Several forms of mx and Σ^ have been discussed in literature (see for instance, Rousseeuw [7], Rousseeuw and van Driessen [8], Billor et al. [9] and Hubert et al. [10]). The most frequently used form of f(ximx,Σ^),i=1,,n is defined in Eq (1). The classical alternatives for mx and Σ^ are defined in Eqs (2) and (3) respectively. Eq (1) can be referred to as the squared Mahalanobis distance. It will be referred to as Ti2 for simplicity.

A large value of Ti2 is a likely indication that the observation with indices corresponding to it is an outlier or better still, an out-of-control process. However, two problems occur in reality. First, outliers or out-of-control process may not essentially have large values of Ti2. For instance, a small cluster of outliers will attract mx and will inflate Σ^ in its direction, leading to small values for Ti2. This problem is referred to as the masking problem since the presence of one outlier masks the appearance of another outlier.

Secondly, not all observations with indices corresponding to large Ti2 values are essentially outliers or an out-of-control process. For instance, a small cluster of outliers will attract mx and inflate Σ^ in its direction and away from some other observations which belong to the trend suggested by the bulk of datasets, thereby yielding large Ti2 values for these observations. This problem is referred to as swamping problem.

Masking and swamping phenomena occur because mx and Σ^ are not robust. One way to remedy these problems is to use more robust estimators for Eqs (2) and (3) as an alternative to the location and covariance matrix. The resulting Ti2 could be used to effectively detect outliers or sustained shifts in the mean vector of the quality characteristics.

Consider the minimum volume ellipsoid, (MVE) of Rousseeuw [7] which covers at least half of the observations to construct robust estimators for location and dispersion in Eqs (2) and (3). The mean vector and covariance matrix of the observations included in the MVE are robust location and dispersion matrix estimators. The advantage of MVE estimators for location and dispersion is that they have a high break down point, (BDP) of approximately 50% (Lopuhaa and Rousseeuw [11]). They can also resist the presence of substantial amount of outlier in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the MVE is computationally intensive, and it may not even be computationally feasible, to implement the MVE algorithm estimators. As an illustration of the MVE’s computational infeasibility, for an n × p data matrix X, if h is the integer part of (n+1)2, then the algorithm needs to compute the volumes of n!h!(nh)! ellipsoids and to choose the ellipsoid with the minimum volume. This means that in an instance where n = 20, there are 184,756 such ellipsoids, for n = 25, there are 5,200,300 such ellipsoid and for n = 30, there are over 155 million ellipsoids to compute and to select the ellipsoid with the minimum volume. This enormous computational demand of MVE which grow geometrically as the sample size, n increases is common to nearly all methods that implement elemental sets or resampling algorithm.

Another robust estimator of mean vector and covariance matrix that can accommodate up to 50% outliers in dataset and also work well for Eqs (2) and (3) in computing Ti2 is the minimum covariance determinant estimator, (MCD). Rousseeuw and van Driessien [8] proposed a fast algorithm for computing the the MCD estimator and they referred to it as FastMCD. The FastMCD correspond to finding the h points for which the classical tolerance ellipsoid has minimum volume, and then taking it center. Consider all Chn subsets, and compute the determinant of the covariance matrix for each subset. The subset with the smallest determinant is used to calculate the usual 1 × p mean vector, m and corresponding p × p covariance matrix, Σ^. This estimation procedure describes the MCD algorithm. The MCD estimator for location and dispersion are actually maximum likelihood estimators when h observations are from the multivariate normal distribution, while the other (nh) observations are from a different, mean shifted, multivariate normal distribution. Like the MVE, it is equivariance and has high breakdown value of 50%. Cator and Lopuhaa [12] examined the asymptotic expansion of MCD estimator and stated that the MCD estimator is efficient and root n-consistent. Furthermore, Cator and Lopuhaa [13] derived the influence function for the MCD estimator. The MCD estimator is robust and equivariant and can resist the presence of substantial amount of outliers in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, it often biasedly estimate the center (mean vector) and as a result underestimates the covariance matrix. According to Pison et al. [14], the MCD estimator is inconsistent and underestimates the volume of the covariance matrix such that robust distances constructed from it are too large resulting in identifying too many observations as outliers when they are originally inliers. This phenomenon is called swamping effect. Swamping makes the control chart identifies too many observations as outlying thereby declaring an originally in-control-process as out-of-control process

Billor et al. [9] proposed forward search algorithm called BACON with two versions of the initial subsample. The algorithm has two “initial starts” in which version one uses the non-robust but equivariant classical mean vector while version two uses the robust but not equivariance median. The BACON algorithm is computable, easy to implement and can resist the presence of substantial amount of outlier in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the version one of BACON is not robust but equivariance. Hence, the subsequent iterations may not be robust as it will depend on the robustness of the initial subset. The version two of BACON is robust but not equivariance and hence the subsequent iterations may not be equivariant.

Hubert et al. [10] proposed the deterministic minimum covariance determinant (DetMCD) algorithm for estimating the location and dispersion matrix in Eqs (2) and (3) when datasets contains outliers and are presumed to exhibit drift or a mean shift in a process. The DetMCD uses six different initial estimators to select six initial subsets denoted as h0=n2. From the six initial h0, a Mahalanobis type distance measure di, l, (i = 1, ⋯, n; l = 1, ⋯,6) is computed. Thereafter, the method selects for all six initial estimates, the h observations xi with the smallest di, l and apply the concentration steps (C-steps) until convergence.

According to Hubert et al. [10], the DetMCD estimator is highly robust and near equivariant. Thus, it can resist the presence of substantial amount of outliers in dataset as well as being able to effectively detect sustained shifts in the mean vector of the quality characteristics. However, the computational requirement of DetMCD is highly enormous. Imagine the computational rigor in computing six different initial estimates that are independent, and can each be seen as “stand alone” estimator. The accompanying C-step also requires enormous computation to converge. Moreover, since the DetMCD use the FastMCD objective, it will inherit the ills of FastMCD as noted in Pison et al. [14]. Thus, it often biasedly estimate the center and as a result underestimates the dispersion matrix. It is inconsistent and underestimates the volume of the covariance matrix such that robust distances constructed from it are too large and hence identifying too many observations as outliers when they are originally inliers, leading to swamping effect.

Vargas [15] and Jensen et al. [16] studied the performance and evaluated several different retrospective multivariate control charts methods whose classical mean vector and covariance matrix in Eqs (2) and (3) are constructed from robust estimation algorithms. Each of the charts were based on a T2 statistic calculated using a selected combination of robust mean vector and covariance matrix estimators. Jobe and Pokejovy [5] noted that most of the robust estimators considered (particularly the MVE and MCD) do not take time (i) into consideration. This is a drawback in relation to detecting certain outlier configurations such as sustained shifts in the mean vector (see Jobe and Pokojovy [5] for details). The inability of some robust methods to account for time i in constructing the Ti2 may be linked to non equivariance tendency. Some methods that work well in this scenario, accounting for time includes that of Chenouri and Variyath [17], Chenouri et al. [18], Variyath and Vattathoor [19], Variyath and Vattathoor [20] Jobe and Pokojovy [5] and Fan et al. [4].

This article proposed a control chart method that is based on regression adjustment and clustering algorithm for retrospective monitoring of individual characteristics. Since the proposed control chart method blends the addition-point regression with an agglomerative hierarchical cluster algorithm, we refer to it as cluster-regression control chart, (crcc for short). Two algorithms are presented for the proposed crcc methodology, one to compute the mean vector and covariance matrix in Eqs (2) and (3) and the second one for computing the cluster-regression control chart statistic denoted as Tcrcc,i2 statistic. The notion for the first algorithm is to replace Eqs (2) and (3) with a robust estimate of mean vector and covariance matrix that does not exhibit masking and swamping and can detect certain outlier configurations such as sustained shifts in the mean vector. To account for time i in the Tcrcc,i2 second phase algorithm, a projector referred to as “anchor point matrix” proposed by Johnson and Wichern [21] and Lawrence et al. [22] is used to trace the data trend through an addition-point least squares regression upon which the cluster distance is constructed. Thus the Tcrcc,i2 is simultaneously able to detect outliers as well as shifts in the mean vector while keeping masking and swamping under control at a given time i.

The remaining part of this article is organized in the following way: Section 2 describes the crcc methodology and provides algorithms for computing the Tcrcc,i2 control chart. Section 3 deals with construction of the controls limits for the proposed Tcrcc,i2 along with performance evaluation through Monte Carlo simulation experiment. A numerical illustration of artificial dataset generated through a Monte Carlo simulation experiment is closely followed by a real life pulp fiber dataset analysis to implement the crcc algorithm in section 4 while section 5 concludes the article.

The crcc Methodology

The cluster-regression control chart is conceived on the notion that the natural trend of multivariate X-quality characteristics can be traced through a projector called “anchor point matrix” which can be used to construct a cluster of h-observations where h(n+p+1)2.

Let mx and Σ^ be the center and deviation from center of an ellipsoid formed from X as defined in Eqs (2) and (3) respectively. Furthermore, let Ω ∈ ℝ(2p + 1) × p be a set containing the {(2p + 1) × p} ellipsoid formed from mx and Σ^, then define Ω as (5) A special case of Ω in which two quality characteristics x1 and x2 are measured is defined below. Note that since two quality characteristics are involved, p = 2 and {(2p + 1) × p} becomes {5 × 2}-matrix described below. (6) where λi and ei are the eigenvalues and eigenvectors of Σ^ and χα,p2 is the cutoff point of the ellipsoid of constant distance (see Johnson and Wichern [21] for details on Ω). If mx and Σ^ are robust, an OLS fit to Ω after it has been augmented with the ith row of X for n times (i = 1, ⋯, n), can mimic the trend of the quality characteristics, X. This way, observations that belong to the in-control-process tend to cluster together while those that belong to the out-of-control process as well as those that exhibit certain outlier configuration tend to cluster together and away from those that belong to the in-control-process. The OLS fit to Ω after it has been augmented with the ith row of X for n times (i = 1, ⋯, n) is referred to as “addition point OLS”. Since the addition-point OLS is performed sequentially, one at a time, by adding one observation from X to Ω and fitting OLS to it, the time component (i) of Tcrcc, i is preserved. Furthermore, the proper working of Ω requires a robust mx and Σ^. Hence, we propose to use a forward search algorithm of Billor et al. [9] to first screen the data of likely outlier configurations and then compute from it, the mean vector, mx and the covariance matrix, Σ^ prior to estimation of Ω. Armed with this synopsis, the crcc algorithms are presented below.

The BACON Forward Search Algorithm

Billor et al. [9] proposed an algorithm, the BACON: block adaptive computationally efficient outlier nominator, for outlier identification as well as estimator for location and dispersion matrix. The BACON algorithm has been used extensively in literature (see [10, 2325]). Their algorithm select the initial subset denoted as m = p + 1 as the observations with indices corresponding to the smallest euclidean distance obtained from the coordinatewise median m˜x.

Having selected the m initial subset, the algorithm then proceeds to increase the size of m by one observation at a time until the initial subset, m contains h observations. An observation is nominated as a candidate for m if its Mahalanobis distance is the smallest among the observations not included in the m initial subset. The algorithm iterates this procedure until all n observations are screened. When h observations are in m, a decision criteria for nominating an outlier is defined, the nominated outliers are removed from X, and the mean vector and covariance matrix are computed for the remaining observations in X using Eqs (2) and (3). Detailed algorithm is given below.

Algorithm 1.

Step 1: Let m˜x be the 1 × p vector of coordinatewise median of X, compute the distance (7)

Step 2: Select the initial subset denoted as m with size (p + 1) to be the observations with indices corresponding to the smallest distances in (Eq 7). Call this initial subset m, “basic subset

Step 3: Compute the discrepancies (8) where mb and Cb are the mean and covariance matrix of the observations in the basic subset. If Cb is not of full rank, increase the initial subset m by adding observations whose indices corresponds to the smallest distances in (Eq 7) until it has full rank.

Step 4: Define the new basic subset as all observations whose discrepancy in (8) is less than cnprχ(p,αn)2, where χ(p,αn)2 is the (1αn) percentile of the chi square distribution with p degree of freedom, cnpr = cnp + chr is a correction factor, chr=max{0,(hr)(h+r)};h=[(n+p+1)2], r is the size of the current basic subset and (9) See Billor et al. [9] for details on the correction factor.

Step 5: Iterate steps 3 and 4 until the size of the basic subset no longer changes.

Step 6: Nominate the observations excluded from the final basic subset as outliers

Step 7: The location and dispersion estimator is computed as the classical mean and covariance of the observations in the final basic subset using Eqs (2) and (3).

The crcc Main Algorithm

Let xij, i = 1⋯, n j = 1, ⋯, p be the jth quality characteristics in the ith sample. The algorithm below computes the proposed Tcrcc,i2 for a given multivariate quality characteristics at time i.

Algorithm 2.

Step 1: Compute the mean vector mx and the covariance matrix, Σ^ using algorithm 1 above

Step 2: Determine a dependent variable, y from among the quality characteristics xij, i = 1⋯, n j = 1, ⋯, p. Two choices are available to achieve this. One way is to first compute a covariance matrix from xij, i = 1⋯, n j = 1, ⋯, p and obtain from the covariance matrix, the eigenvalues of each quality characteristics in xij. The variable or a quality characteristic with the least eigenvalue is nominated as the dependent variance while the remaining (p − 1)-variables are treated as regressor variables. The second choice of determining the dependent variable is described below:

  1. Regress xj on all the other predictors for j = 1, ⋯, p. This will yield p different regression models for instance, suppose there are 3 quality characteristics denoted as xi1,xi2, and xi3 then the p = 3 regression models are (10) (11) (12)
  2. For all models in Eqs (10), (11) and (12), obtain the corresponding R2-values and denote the dependent variable corresponding to the model with the highest R2 as the overall dependent variable while the remaining (p − 1)-variables are treated as regressors for subsequent addition point OLS.

Step 3: Construct the {(2p + 1) × p} projector (anchor-point matrix) as described in Eq (6) above.

Step 4: Determine the (n × p) data matrix B with the ith row of B denoted by a (1 × p) vector of bi to be the estimator that result from an OLS regression of Ω, augmented by the ith row of xij.

Step 5: Compute an (n × n) similarity matrix S whose elements are defined by (13) The elements of S serves as a distance metric upon which an agglomerative hierarchical cluster (AHC) analysis of a complete linkage is performed on the data. The AHC then partition the dataset xij into the main cluster Cm containing at least h observations and the remaining observations fall into one of τ minor clusters labeled as Cτ1, Cτ2, Cτ3, ⋯. See [2628] for details on cluster analysis.

Step 6: Fit a regression model to the observations in the main cluster Cm and obtain from it, the fitted values y^i,i=1,,n as well as the prediction variance (14) where ri is the residuals from regression model and 1.4826 is a turning constant (see Maronna et al. [29] for details). At this stage, the data points in the minor cluster have not been used and they are said to be inactive. The activation process of the minor cluster is done sequentially, one cluster at a time in the following way:

  1. Augment the main cluster, Cm with the first minor cluster Cτ1 and obtain an OLS fitted values denoted as y^i+Cτ1,i=1,,n
  2. Obtain the difference in fits statistic, DFFITS as (15)
  3. If DFFITSCτ1χα,p2, then Cτ1 minor cluster is included in the main cluster, else, the minor cluster is excluded from xij and remain inactive throughout the crcc estimation process. This procedure is repeated for all minor clusters. Thus, observation that does not harm the fit produces small value of DFFITSCτ1 and hence, they are activated. However, outliers and mean shift data points tend to produce large values of DFFITSCτ1 and as a result, they are not activated.

Step 7: After the minor clusters have been activated by augmenting the main cluster with the minor clusters that satisfy the augmentation condition or otherwise, the mean vector and covariance matrix for crcc are then estimated from data points arising from this activation process as (16) and (17) where na is the sample size of the augmented main cluster and xija is the p-dimensional multivariate quality characteristics in the current augmented main cluster. The corresponding Tcrcc,i2-control chart plots (18)

Construction of Control Limits and Performance Evaluation of Tcrcc,i2

This section discusses the construction of the control limits of the proposed Tcrcc,i2 as well as examining its performance based on masking and swamping rates. Two other robust control chart methods extensively discussed in literature will be compared with the proposed Tcrcc, i. They are: the re-weighted MVE and MCD control charts denoted as TRMVE,i2 and TRMCD,i2 respectively. These methods have been discussed extensively in literature. A few list includes Chenouri and Variyath [17], Chenouri et al. [18], Variyath and Vattathoor [19], Variyath and Vattathoor [20] and Jensen et al. [16].

Construction of Control Limits Using Simulation Algorithm

In order to implement the proposed Tcrcc,i2 control chart and to also compare the detection performance of various control charts, control limits have to be established first especially when the methods to be considered are based on the robust estimate of mean and variance whose distribution is rarely known. Several approaches have been used in literature for scenarios where the distributions of the control chart parameters are unknown. For instance, Chenouri et al. [18] proposed a robust control chart method in which the classical mean and covariance estimator was replaced with the mean and covariance estimator derived from the re-weighted version of the MCD algorithm. They constructed the control limits by using the result of a Monte Carlo simulation experiment to model the empirical distribution of the re-weighted MCD algorithm through a family of regression curves. The control limit of their control chart is then estimated from the empirical model at a known value of the dimension p and the sample size, n. Similar approach of constructing empirical distribution in form of a regression model can also be found in the work of Chenouri and Variyath [17] and Variyath and Vattathoor [19]. This approach works well in detecting little shifts in the mean especially in scenarios where data are contaminated with outliers. A method that is not based on modelling the empirical distribution of the location and dispersion estimators can be found in Jensen et al. [16] and Variyath and Vattathoor [20]. This method constructs the control limits of the robust control charts by iteratively computing the T(n)2-statistic for i iterations and then taking the (1 − α) 100th percentile of the T(n)2-statistic as the established control limits. Our method follow this prescription, with our objective being that the Tcrcc,i2 is able to detect little shift in mean in the presence of outliers. Consequently, the estimation of the control limit for Tcrcc,i2 along with that of TRMVE,i2 and TRMCD,i2 is described below.

Given an overall false alarm rate of α, a control limit (cα) can be obtained from Eq (19) (19) Because of the invariance of the TRMVE,i2, TRMCD,i2 and the Tcrcc,i2 statistics, μ and Σ of the in-control multivariate normal distribution are set to zero vector 0 and identity matrix I respectively. Applying Eq (19), the simulation runs for constructing control limits can be performed using the following algorithm:

  1. Generate a dataset containing n independent observations from Np(0,I).
  2. Compute Ti2,i=1,,n and obtain the largest value T(n)2.
  3. Repeat steps 1 and 2 for 100,000 times.
  4. Take the (1 − α) 100th percentile of T(n)2 as the established control limit.

The resulting control limits at various levels of p and n are presented in Tables 1, 2 and 3 for Tcrcc,i2, TRMVE,i2 and TRMCD,i2 respectively.

thumbnail
Table 1. The simulated control limits for Tcrcc,i2 statistic under various combinations of n and p at an overall fixed false alarm rate of 0.05.

https://doi.org/10.1371/journal.pone.0125835.t001

thumbnail
Table 2. The simulated control limits for TRMVE,i2 statistic under various combinations of n and p at an overall fixed false alarm rate of 0.05.

https://doi.org/10.1371/journal.pone.0125835.t002

thumbnail
Table 3. The simulated control limits for TRMCD,i2 statistic under various combinations of n and p at an overall fixed false alarm rate of 0.05.

https://doi.org/10.1371/journal.pone.0125835.t003

As the overall false alarm rate was fixed at 0.05, it can be seen from Tables 1, 2 and 3 that the control limits for the proposed Tcrcc, i is stable at various levels of n and p when compared to the TRMVE,i2 and TRMCD,i2. The TRMVE,i2 seems to be moderate when compared to the TRMCD,i2 counterpart. It is important to mention here that these limits work well for high dimensions say 5–10. However, with small sample sizes, the control limits may not be reliable.

Performance Evaluation Through Monte Carlo Simulation Experiment

In order to examine the performance of the proposed crcc on effective detection of outliers and mean shift as well as simultaneously minimizing swamping rate, Monte Carlo simulation experiment is performed. In the course of evaluation, the proposed method is compared with other robust multivariate control chart methods such as the RMVE and and RMCD. The detection rates are used as the yardstick for the performance evaluation. The detection rate is computed as the ratio of the number of correctly identified outliers or mean shift by the methods to the total number of the outliers or mean shift in the dataset. Precisely, let the number of outliers in the dataset be π = ⌊⌋ and the number of outliers detected by a control chart method at a given non-centrality parameter λ2 be π(λ2), then the detection rate of a method T,i2,=crcc,RMVE,RMCD,i=1,,m is computed as (20) where m is the sample size.

Following the methodology of Variyath and Vattathoor [19], the datasets are generated from a standard multivariate normal distribution MVN(0, Ip) with r = 100,000 sample runs of size m. Five regimes are considered such that each regime describes the sample size, m and the dimension p, for instance; Regime 1: {m = 30, p = 2}; Regime 2: {m = 50, p = 3}; Regime 3: {m = 100, p = 5}; Regime 4: {m = 150, p = 7}; Regime 5: {m = 200, p = 10}. Within each regime, we considered four levels of mean shifts in data contaminations, namely, γ = 0.00, γ = 0.10, γ = 0.20 and γ = 0.25. Without loss of generality, we further assume that a process with the same covariance matrix has been shifted from μ0 = 0 to μ1 = (λ,0, ⋯,0)′ where a non-centrality parameter, λ2 = (μ1μ0)′ Σ−1(μ1μ0) represents the size of a process shift taken to be λ2 = 0,5,10,15,20. The clean datasets were simulated for observations with the indices i = 1, ⋯, mh where h = ⌊⌋ and γ is the fraction of data contamination while the contaminated datasets were simulated for observations with the indices i = mh + 1, ⋯, m. The simulation experiment is replicated for r = 100,000 runs and for each dataset xijr, the the detection rate is computed using the three methods. The detection rate is expected to be as close as possible to 1 for a method to be classified as performing well. However, the detection rate should close to zero for the null model where λ2 = 0. Using Eq (20), the results of the simulation experiment are presented in Tables 4, 5 and 6 for crcc, RMVE and RMCD control charts respectively.

thumbnail
Table 4. Simulation result of Performance Evaluation for crcc Control Chart.

https://doi.org/10.1371/journal.pone.0125835.t004

thumbnail
Table 5. Simulation result of Performance Evaluation for RMVE Control Chart.

https://doi.org/10.1371/journal.pone.0125835.t005

thumbnail
Table 6. Simulation result of Performance Evaluation for RMCD Control Chart.

https://doi.org/10.1371/journal.pone.0125835.t006

From Tables 4, 5 and 6, it can be seen that:

  1. At the null model ie λ2 = 0, all methods considered performed quite well and hence, they did not identify any point as outliers and/or out-of-control.
  2. The crcc has a higher detection rate compared to the RMVE and RMCD. This is noticeable when the non-centrality parameter is small (say 5 ≤ λ2 ≤ 10). Hence, the crcc is sensitive in detecting small shift in the in-control mean when outliers are present in datasets
  3. The RMVE and RMCD performed quite well especially when the non-centrality parameter is large (say λ2 > 10)
  4. In all scenarios considered, the sample size m and the dimension p has little or no effect on the detection rate.
  5. The detection rate of all methods considered increases as the non-centrality parameter (mean shift = λ2) and the level of contamination (outliers = γ) increases. Hence, λ2 and γ are the two parameters that influences the detection rate.

Numerical Implementation and Real Life Data Application of crcc Algorithm

Artificial Data

In order to facilitate the understanding of crcc algorithm and the working mechanisms, a follow up numerical illustration is given below. Following the methodology of Lawrence et al. [22], two variable quality characteristics of sample size 12 is simulated in the following way: Observations 1–9 comes from xi2 = 200 − 4xi1 + ei and xi1U[10; 20] with the error term eiN(μ = 0, σ = 5). Observations 10–12 are outliers deliberately planted such that: for 10 ≤ i ≤ 12, xi1 = (29, 30, 31) and xi2 = (153.800, 155.800, 80.932). The data arising from the process is presented in Table 7 below

The stepwise implementation of crcc algorithm on the two variable artificial data is presented below.

Step 1: Given that p = 2, n = 12, and h = ⌊(n + p + 1)/2⌋ = 7, the (1 × 2) mean vector mx and the (2 × 2) covariance matrix computed from algorithm 1 in section 2.1.1 are: (21) and (22)

Step 2: The eigenvalues of xi1 and xi2 denoted as λ1 and λ2, computed from the covariance matrix, Σ^ in Eq (3) are (23) since xi2 has the least eigenvalue of λ2 = 1.0042, it is nominated and denoted as the dependent variable, yi.

Step 3: The eigenvectors of xi1 and xi2 denoted as e1 and e2 computed from the covariance matrix, Σ^ in Eq (3) are (24) with χ0.975,22=7.3777 and hence the (5 × 2) projector matrix described in Eq (6) is given as (25)

Step 4: The (12 × 2) data matrix B resulting from an OLS regression of Ω augmented by the ith row of xij is computed this way: Add the first row of the data in Table 7 to Ω to obtain the data in Table 8.

Notice that Ω has become 6 × 2 matrix because row 1 of Table 7 has been merged with Ω in Eq (25). An OLS fit to the data in Table 8 result in the estimates b1 = 200.7503, −4.4460. Perform this operation for n = 12 times. Note that the augmentation is done without replacement. The resulting estimate is presented in Table 9 below

Step 5: The (n × n) similarity matrix computed for the data in Table 9, using Eq (13) is presented in Table 10 below

thumbnail
Table 10. Similarity Matrix sij=(bibj)(Σ^(B))1(bibj).

https://doi.org/10.1371/journal.pone.0125835.t010

The agglomerative hierarchical cluster analysis based on the similarity matrix produces the dendrogram plot in Fig 1 below.

Notice that the 3 outliers cluster differently while the inliers cluster together. The plot shows that the main cluster CM contains observations 1:9 while the 3 minor clusters are Cτ1 = {10}, Cτ2 = {11}, Cτ3 = {12}

Step 6: An OLS fit the the data points in the main cluster yields the estimates bj = 204.4238, −4.3132 and the corresponding prediction variance computed using Eq (14) is σy^2=8.4338. The 3 minor clusters are investigated for likely activation through the DFFITSτi. Their corresponding DFFITSτi as well as the cutoff value are (26) Notice that the DFIITS-statistics for the third cluster, Cτ3 is less than the cutoff value. Hence cluster 3 with observation number 12 is activated while cluster 1 and 2 with observations numbers 10 and 11 respectively, remain inactive.

Step 7: Having activated cluster 3, na = 10, the crcc control chart parameters are computed thus: (27) and (28) The resulting control chart statistic, Tcrcc,i2 alongside two robust control chart methods such as the RMVE control chart, TRMVE,i2, the RMCD control chart, TRMCD,i2 and the classical Hotelling Ti2 control chart statistics are presented in Table 11 below.

thumbnail
Table 11. The crcc, RMVE, RMCD, and classical Hotelling’s T2 statistics.

https://doi.org/10.1371/journal.pone.0125835.t011

The corresponding control chart for the four methods is presented in Fig 2 below

Notice from Table 11 and Fig 2 above that the crcc control chart is able to detect the 3 spurious outliers planted at index 10, 11 and 12 as an out-of-control points. The crcc control chart has the feature to detect the in-control-trend while the dendrogram plot in Fig 1 also depict the outlier structure in multivariate data. The RMVE and RMDC were able to detect observations 10–12 as an out-of-control point. However, due to swamping effect, they erroneously nominated observations 1 and 5 which were originally in-control-points as out-of-control points. Surprisingly, the classical Hotelling T2 computed from the function qcc in R language could not even identify any of the 3 mean shift outlier points. This is because observations 10–12 attracted the mean to themselves and away from the other in-control observations so that the covariance matrix becomes arbitrarily large and the squared Mahalanobis distances computed based on these mean and covariance becomes arbitrarily small and hence, the 3 observations are masked.

The Pulp Fiber Real-life Dataset

Following the experimental design of Lee [30], Rousseeuw et al. [31] describe an experiment that was conducted to determine the effect of pulp fiber properties on the paper made from them. The pulp fiber properties that were measured for paper production are: mean fiber length, xi1, long fiber fraction, xi2, fine fiber fraction, xi3, and zero span tensile, xi4. The paper properties that were measured after production are: breaking length, yi1, elastic modulus, yi2, stress at failure, yi3, and burst strength, yi4. The dataset arising from this process contains n = 62 measurements as presented in Table 12.

The crcc step-wise algorithm for the pulp-fiber data is presented below.

Step 1: Given that p = 8, and h = ⌊(n + p + 1)/2⌋ = 35, the (1 × 8) mean vector mx and the (8 × 8) covariance matrix computed from algorithm 1 in section 2.1.1 are: (29) and (30)

Step 2: The eigenvalues of xi1, xi2, xi3, xi4, xi5, xi6, xi7 and xi8 denoted as λ1, λ2, λ3, λ4, λ5, λ6, λ7 and λ8, computed from the covariance matrix, Σ^ in Eq (30) are (31) since xi8 has the least eigenvalue of λ8 = 3.8e−5, it is nominated and denoted as the dependent variable, yi and the remaining variables are treated as regressors at the regression stage of crcc.

Step 3: The eigenvectors of xi1, xi2, xi3, xi4, xi5, xi6, xi7 and xi8 denoted as e1, e2, e3, e4, e5, e6, e7 and e8 computed from the covariance matrix, Σ^ in Eq (30) are (32) with χ0.975,82=17.5346 and hence the (17 × 8) projector matrix described in Eq (6) is given as (33)

Step 4: The (62 × 8) data matrix B resulting from an OLS regression of Ω augmented by the ith row of xij is presented in Table 13

thumbnail
Table 13. Addition-Point OLS matrix B for Pulp-fibre Dataset.

https://doi.org/10.1371/journal.pone.0125835.t013

Step 5: The (n × n) similarity matrix computed for the data in Table 13, using Eq (13) is used to perform an agglomerative hierarchical cluster analysis which yields the dendrogram plot in Fig 3 below.

Notice from Fig 3 that the main cluster Cm contains observations indexed 1–21, 23–27, 29–41, and 43–45. The remaining observations belongs to the Cτi such that Cτ1 = {22}, Cτ2 = {28}, Cτ3 = {42} Cτ4 = {46, 47}, Cτ5 = {48}, Cτ6 = {49, 50}, Cτ7 = {51, 52, 56}, Cτ8 = {53, 54, 55, 57}, Cτ9 = {58}, Cτ10 = {59, 62}, Cτ11 = {60, 61}

Step 6: An OLS fit to the data points in the main cluster yields the estimates bj = -2.0016, 0.5229, -0.0129, -0.0081, -1.2572, 0.1118, 0.3068, 0.0882 and the corresponding prediction variance computed using Eq (14) is σy^2=0.0768. The 11 minor clusters are investigated for likely activation through the DFFITSτi. Their corresponding DFFITSτi as well as the cutoff value and activation status are (34)

Notice that the DFIITS-statistics for the second, third and eighth clusters are less than the cutoff value. Hence they are activated in the estimation of crcc charting parameters while the other minor clusters remain inactive. The inactive observations are then removed from the pulp fiber dataset X. The resulting observation xija is then used to compute the mean vector and covariance matrix.

Step 7: Having activated the 3 minor clusters, the crcc control chart parameters are computed thus: (35) and (36) The resulting control chart statistic, Tcrcc,i2 is presented in Table 14 below while the crcc control chart is plotted in Fig 4(a).

thumbnail
Fig 4.

(a) Control Chart for Pulp fiber Dataset. (b) Revised control Chart for Pulp fiber Dataset.

https://doi.org/10.1371/journal.pone.0125835.g004

Notice from Table 14 and Fig 4(a) that that observations number 22, 46–48, 51–52, 56, and 58–62 are out-of-control points. According to Rousseeuw et al. [31] and Jung [32], these out-of-control points are said to be observations produced from different woods and different pulping process from those of other observations. The details of these out-of-control points are given below: From the source of the data, it was found that all but the last four pulp samples (observations 59–62) were produced from fir wood. Furthermore, most of the out-of-control samples were obtained using different pulping processes. For instance, observation 62 is the only sample from a chemithermomechanical pulping process, observations 60 and 61 are the only samples from a solvent pulping process, and observations 51, 52, and 56 are obtained from a kraft pulping process. Finally, the smaller outliers (22, 46–48, and 58) all were Douglas fir samples. Consequently, these out-of-control points are removed from the dataset and the crcc revised control chart are constructed for the remaining observation. Thus the in-control parameters (mean vector and covariance matrix) computed using algorithm 1 in section 1 are: (37) and (38) while the revised control chart statistic, Tcrccrevised,i2 is presented in Table 15

The crcc control chart is presented in Fig 4(a) while revised control chart based on the in-control-parameters is presented in Fig 4(b). Notice that the process is in a state of statistical control and hence, this in-control parameters can be adopted as a standard for the pulping process and paper produced from them.

The RMVE and RMCD control charts were also used to analyze the pulp fibre dataset using the function cov.rob in the MASS package of R software and we found that the results obtained were the same as the proposed method and hence its was not reported.

Conclusion

The real life application of quality management requires simultaneous monitoring of multiple quality characteristics. Real life data from production and service processes often contains spurious observations whose causes can be traced to assignable variation. This spurious variables often go unnoticed if proper statistical techniques are not employed prior to control chart construction. A multivariate control chart method known as the cluster-regression control chart crcc is proposed to simultaneously screen dataset for likely outlier structure and mimic the data trend prior to the construction of the control chart.

Most often, the assumptions needed for large sample theory are better approximated by the distribution of the untrimmed data than by the entire data set, and it is often suggested that statistical analysis should be conducted on the “cleaned data set” when the outliers have been deleted. The proposed method follow this prescription with our objective being that, the parameters of the control chart can be better estimated when the outlying observations which depart from the trend exhibited by the bulk of the dataset have been removed. The proposed method has the tendency to identify mean shift and outliers in datasets while keeping masking and swamping under control.

No single robust algorithm estimator seems to be outstanding, and for any given estimator, it is easier to find outlier configurations and mean shift in datasets where the estimator fails and hence, we state that the performance of the proposed crcc method can be quite poor when the level of data contamination go beyond 40% of the sample size.

Acknowledgments

This work was supported in part by U.S.M. Fundamental Research Grant Scheme (FRGS) No. 203/PMATHS/6711319.

We are grateful to anonymous reviewers whose constructive comments have improved the quality of this article.

Author Contributions

Conceived and designed the experiments: HCO EA. Performed the experiments: HCO EA. Analyzed the data: HCO EA. Contributed reagents/materials/analysis tools: HCO. Wrote the paper: EA.

References

  1. 1. Staudte RG, Sheather SJ. Robust Estimation and Testing. Wiley Series in Probability and Statistics. Wiley; 1990.
  2. 2. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Statistics. Wiley; 1986.
  3. 3. Montgomery DC, Peck EA, Vining GG. Introduction to Linear Regression Analysis. Wiley Series in Probability and Statistics. Wiley; 2001.
  4. 4. Fan SKS, Huang HK, Chang YJ. Robust Multivariate Control Chart for Outlier Detection Using Hierarchical Cluster Tree in SW2. Quality and Reliability Engineering International. 2013;29(7):971–985.
  5. 5. Jobe JM, Pokojovy M. A Multistep, Cluster-Based Multivariate Chart for Retrospective Monitoring of Individuals. Journal of Quality Technology. 2009;41(4):323–339.
  6. 6. Mason RL, Young JC. Multivariate Statistical Process Control with Industrial Applications. ASA-SIAM Series on Statistics and Applied Probability. Society for Industrial and Applied Mathematics; 2002.
  7. 7. Rousseew PJ. Multivariate estimation with with high breakdown points. In Mathematical Statistics and applications (eds Grossmann W, Pflug G, Vincze I and Wertz W). 1985;B(1):pp. 283–297. Dordrecht: Reidel.
  8. 8. Rousseeuw PJ, Driessen Kv. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics. 1999;41(3):pp. 212–223.
  9. 9. Billor N, Hadi AS, Velleman PF. BACON: blocked adaptive computationally efficient outlier nominators. Computational Statistics & Data Analysis. 2000;34(3):279–298.
  10. 10. Hubert M, Rousseeuw PJ, Verdonck T. A Deterministic Algorithm for Robust Location and Scatter. Journal of Computational and Graphical Statistics. 2012;21(3):618–637.
  11. 11. Lopuhaa HP, Rousseeuw PJ. Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices. The Annals of Statistics. 1991 03;19(1):229–248.
  12. 12. Cator EA, Hendrik P Lopuhaa k. Asymptotic expansion of the minimum covariance determinant estimators. Journal of Multivariate Analysis. 2010;101(10):2372–2388.
  13. 13. Cator EA, Lopuhaa HP. Central limit theorem and influence function for the MCD estimators at general multivariate distributions. Bernoulli. 2012 05;18(2):520–551.
  14. 14. Pison G, Van Aelst S, Willems G. Small sample corrections for LTS and MCD. Metrika. 2002;55(1–2):111–123.
  15. 15. VARGAS N JA. Robust estimation in multivariate control charts for individual observations. Journal of Quality Technology. 2003;35(4):367–376.
  16. 16. Jensen WA, Birch JB, Woodall WH. High breakdown estimation methods for phase I multivariate control charts. Quality and Reliability Engineering International. 2007;23(5):615–629.
  17. 17. Chenouri S, Variyath AM. A comparative study of phase II robust multivariate control charts for individual observations. Quality and Reliability Engineering International. 2011;27(7):857–865.
  18. 18. Chenouri S, Steiner SH, Variyath AM. A multivariate robust control chart for individual observations. Journal of Quality Technology. 2009;41(3).
  19. 19. Variyath AM, Vattathoor J. Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations. Journal of Quality and Reliability Engineering. 2013;2013.
  20. 20. Variyath AM, Vattathoor J. Robust control charts for monitoring process variability in phase I multivariate individual observations. Quality and Reliability Engineering International. 2014;30(6):795–812.
  21. 21. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Prentice Hall; 1992.
  22. 22. Lawrence DE, Birch JB, Chen Y. Cluster-Based Bounded Influence Regression. Quality and Reliability Engineering International. 2014;30(1):97–109.
  23. 23. Alih E, Ong HC. Robust Cluster-Based Multivariate Outlier Diagnostics and Parameter Estimation in Regression Analysis. Communications in Statistics-Simulation and Computation. 2014;(Accepted). Available from: http://dx.doi.org/10.1080/03610918.2014.960093.
  24. 24. Alih E, Ong HC. An Outlier-resistant Test for Heteroscedasticity in Linear Models. Journal of Applied Statistics. 2015;(Accepted. ).
  25. 25. Alih E, Ong HC. Cluster-based multivariate outlier identification and re-weighted regression in linear models. Journal of Applied Statistics. 2015;42(5):938–955.
  26. 26. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley; 2005.
  27. 27. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley; 2009.
  28. 28. Everitt BS, Landau S, Leese M, Stahl D. Cluster Analysis. Wiley series in probability and statistics. Wiley; 2011.
  29. 29. Maronna RA, Martin DR, Yohai VJ. Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. Wiley; 2006.
  30. 30. Lee J. Relationships Between Properties of Pulp-Fibre and Paper. Unpublished Doctoral Thesis, Faculty of Forestry, University of Toronto; 1992.
  31. 31. Rousseeuw PJ, Aelst SV, Driessen KV, Agullo J. Robust Multivariate Regression. Technometrics. 2004;46(3):pp. 293–305.
  32. 32. Jung KM. Multivariate least-trimmed squares regression estimator. Computational Statistics and Data Analysis. 2005;48(2):307–316.