The most fundamental phenomenon about the hierarchical nature of individual and group influences in multilevel research is that measurements on individuals (e.g., employee, student, patient) within the same group (e.g., organization, classroom, clinic) are presumably more similar than measurements on individuals in different groups. Accordingly, various forms of the intraclass correlation coefficient (ICC) have been proposed to represent the reliability or degree of resemblance among cluster members. Essentially, they can be interpreted as the proportion of the total variance of the response that is accounted for by the clustering or group cohesion. However, different conceptual frameworks and modeling formulations of a multilevel study ultimately lead to distinct and unique definition of ICCs. Comprehensive reviews and general guidelines were provided in Bartko (1976), McGraw and Wong (1996), and Shrout and Fleiss (1979) for selecting the appropriate model and ICC as an interrater reliability measure in one-way random effects and two-way random effects or mixed effects models. Moreover, definitional issues and methodological appraisals concerning ICC, interrater reliability, and interrater agreement can be found in Bliese (2000), James (1982), LeBreton et al. (2003), LeBreton and Senter (2008), and the references therein.

To assess the magnitude of similarity or interrelation of hierarchical data, the ICC(1) and ICC(2) indices based on the one-way random effects model are the two most frequently adopted reliability measures for the single score and average score ICCs, respectively, within the context of multilevel modeling. Specifically, the well-established single score and average score ICCs ρ and ρ* are defined as

$$ \rho =\frac{\sigma_{\gamma}^2}{\sigma_{\gamma}^2+{\sigma}_{\varepsilon}^2}, $$

and

$$ {\rho}^{*}=\frac{\sigma_{\gamma}^2}{\sigma_{\gamma}^2+{\sigma}_{\varepsilon}^2/K}, $$

respectively, where σ 2 γ represents the between-group variance, σ 2 ε is the within-group variance, and K is the group size. The two definitions of ICCs reveal that the average score ρ* is always greater in magnitude than the single score counterpart ρ and the magnitude of ρ* is greatly influenced by the group size.

The most commonly used estimators of ρ and ρ* are given by

$$ \mathrm{I}\mathrm{C}\mathrm{C}(1)=\frac{MSB-MSW}{MSB+\left(K-1\right)MSW}=\frac{F*-1}{F*+K-1}, $$

and

$$ \mathrm{I}\mathrm{C}\mathrm{C}(2)=\frac{MSB-MSW}{MSB}=1-\frac{1}{F*}, $$

where MSB is the between-group mean square, MSW is the within-group mean square, and F* = MSB/MSW calculated from the one-way random effects model. This index ICC(2) follows the notation of Bartko (1976), Bliese (2000), and James (1982). However, it has also been referred to as ICC(k) in McGraw and Wong (1996) and as ICC(1, k) in Shrout and Fleiss (1979). In general, ICC(1) is an estimate of effect size indicating the extent to which individual ratings are attributable to group membership, whereas ICC(2) estimates the reliability of mean ratings furnished by a group of judges.

The respective magnitudes of ICC(1) and ICC(2) allow researchers to appraise the level of observed variance of single score and average score that is affected by clustering. For example, in a two-level analysis of the influence of classroom climate perceptions on individual students’ levels of academic achievement, a ICC(1) value of 0.2 indicates that 20 % of the observed variance in students’ achievement scores is due to systematic between-classroom differences compared to the total variance in achievement scores. In contrast, a value of ICC(2) = 0.8 represents that 80 % of the observed total variance in classroom average scores occurring at the classroom level. Consequently, the use and interpretation of ICC(1) and ICC(2) are appropriate if a researcher is interested in drawing inferences concerning the reliability of single score and average score, respectively.

To further illustrate the fundamental differences between the two indices, it is instructive to consider the ICC(1) value of 0.10 yields the ICC(2) of 0.7, 0.8, and 0.9 for the group size of 21, 36, and 81, respectively. The review of climate studies in James (1982) showed that ICC(1) values range from 0.0 to 0.5 with a median of approximately 0.12. Moreover, Hedges and Hedberg (2007) reported that the resulting ICC(1) values for a variety of school performance studies are generally in the range of 0.10 to 0.25. In contrast, a ICC(2) value of 0.7 has been widely used as the minimum acceptable level of reliability for psychological measures. However, Lance, Butts, and Michels (2006) noted that many researchers did not provide adequate justification for the appropriateness of the commonly used cut point of 0.7. Consequently, it should not be treated as a universal standard.

Within the context of one-way random effects modeling, it follows from the standard results that E[MSB] = 2 γ  + σ 2 ε and E[MSW] = σ 2 ε (McGraw & Wong, 1996, Table 3). Hence, (MSBMSW)/K and MSW are unbiased estimators of σ 2 γ and σ 2 ε , respectively. From the viewpoint of estimation principle, ICC(1), introduced by Fisher (1938), is obtained by substituting the variance components in population single score ICC with corresponding unbiased estimators. Although this natural modification is intuitive and heuristic, ICC(1) is not an unbiased estimator of the corresponding individual rating ICC. The interested reader is referred to Searle, Casella, and McCulloch (1992) for further technical details of various methods for estimating the within-group and between-group variances of random effects models. Note that Olkin and Pratt (1958) have derived the minimum variance unbiased estimator of the single score ICC, but its use has been impeded by the lack of a closed form expression. The corresponding computation requires a special purpose computer program; see, for example, Donoghue and Collins (1990). Moreover, Donner (1986) and Harris and Burch (2000) presented extensive discussions of compelling alternatives and associated properties for estimating the individual rating ICC.

In addition to the theoretical developments in statistical literature, Bliese and Halverson (1998) suggested the corrected eta-squared formula as a modification of sample eta-squared estimator to provide more accurate estimates of the single score ICC. With the emphasis on the analysis of group-level properties in organizational research, the empirical investigation of Bliese and Halverson (1998) focused on the behavior of sample eta-squared estimator. The numerical results showed that sample eta-squared is a positively biased estimate of the individual rating ICC and the performance varies with group size and the magnitude of population intraclass correlation. However, they did not examine the inherent properties of the corrected eta-squared formula. Shieh (2012) recently showed that the corrected eta-squared estimator described in Bliese and Halverson (1998) is identical to the maximum likelihood estimator and presented an extensive comparison between their truncated versions for negative values. The modification of corrected eta-squared estimator performs better when the underlying population single score ICC is small. Conversely, the adjusted ICC(1) has a clear advantage for medium and large magnitudes of population individual rating ICC. Thus, the existing findings have concluded that although ICC(1) is the best known, it may not always be the best choice.

Unlike the prevalent attention and investigation of ICC(1) and related indices for the analysis of multilevel questions, the theoretical property and intrinsic appropriateness of ICC(2) for the estimation of the average score ICC have been given insufficient consideration in the literature. Basically, the average score ICC is a function of the individual rating ICC through the Spearman-Brown prophesy formula (Brown, 1910; Spearman, 1910). It also can be readily established that the formulation of ICC(2) is equivalent to the Spearman-Brown equation by replacing the population single score ICC with ICC(1). Note that the desirable estimation property of an individual rating ICC index for the population single score ICC does not naturally extended to the corresponding Spearman-Brown counterpart for the estimation of the associated population average score ICC. Despite the direct connection between the individual rating and average score ICCs, ICC(2) not only has a unique interpretation as a reliability index of group mean rating, but also possesses completely different properties from ICC(1). The existing findings associated with ICC(1) are arguably not suitable to demonstrate the explicit performance of ICC(2). More importantly, the estimation problem of the mean rating ICC should be duly recognized and it requires a unified and rigorous treatment to clarify the stochastic behavior of feasible solutions.

The continual use of ICC(2) as the standard average score ICC index without identification of the essential limitations may not facilitate a better interpretation and application of research findings. For the ultimate aim of selecting the most appropriate methodology, it is vital to ensure that the contrasting properties of ICC(2) and viable alternatives are thoroughly explicated. The present article purports to contribute to the literature on choosing the best index of the average score ICC within the framework of one-way random effects model. First, a simplified expression is presented to synthesize the essential attributes of the single score ICC estimators in Gleason (1997) and Harris and Burch (2000). Then the Spearman-Brown formula is applied to obtain a useful class of estimators of the average score ICC. Second, in order to judge the merits of various measures from the point estimation perspective, explicit analytic forms of the bias and mean square error (MSE) are derived for the considered mean rating ICC indices. Accordingly, the optimal estimators under bias and MSE considerations are identified. Third, numerical appraisals are performed to illustrate the relative performance of the renowned ICC(2) and several desirable measures within the suggested family of average score ICC estimators. A discussion of potential implications of the findings for both theoretical development and practical use in reliability study is also presented.

Estimation of the average score intraclass correlation coefficient

Within the context of multilevel analysis, a widely used design is the one-way random effects model

$$ {Y}_{ij}=\upmu +{\upgamma}_i+{\upvarepsilon}_{ij},i = 1, \dots, N;j = 1, \dots, K, $$
(1)

where Y ij is the jth individual measurement within group i, μ is the grand mean, and γ i and ε ij are independent random variables with γ i  ~ N(0, σ 2 γ ) and ε ij  ~ N(0, σ 2 ε ). The variance of Y ij is then given by σ 2 γ  + σ 2 ε . Accordingly, the single score ICC ρ = σ 2 γ /(σ 2 γ  + σ 2 ε ) is the simple correlation coefficient Corr(Y i1, Y i2) between any two observations, Y i1 and Y i2, in the same group. The single score ICC ρ may also be interpreted as the proportion of the total variance of individual response that is accounted for by the clustering or group cohesion. In contrast, the simple correlation coefficient \( Corr\left({{\overline{Y}}_i}_{(1)},{{\overline{Y}}_i}_{(2)}\right) \) between any two sets of mean measurements, \( {{\overline{Y}}_i}_{(1)}={\displaystyle \sum_{j=1}^K{Y}_{ij}/K\kern0.5em \mathrm{and}\kern0.5em {\overline{Y}}_{i(2)}={\displaystyle \sum_{j=K+1}^{2K}{Y}_{ij}/K}} \), from the same group is defined as the average score ICC ρ * = σ 2 γ /(σ 2 γ  + σ 2 ε /K). It also represents the proportion of the total variance of mean ratings for a group of K judges that is accounted for by the grouping or cluster membership. The prominent coefficient ρ* can also be written in the form of Spearman-Brown prediction formula ρ* = Ψ(ρ) where

$$ \Psi \left(\uprho \right)=\kern0.5em \frac{K\rho }{1+\left(K-1\right)\rho }. $$
(2)

It is straightforward to show that the ICC(2) index is basically the Spearman-Brown formula applied to ICC(1) or ICC(2) = Ψ{ICC(1)} for any value K > 1. This particular result was also noted in James (1982) and is more precise than the asymptotic equivalence between ICC(2) and Ψ{ICC(1)} demonstrated in Bliese (1998).

The simple notion of substituting the individual rating index ICC(1) into the Spearman-Brown prophesy formula to attain the average score measure ICC(2) suggests a integrated approach to utilizing the existing procedures for the estimation of individual rating ICC to the estimation of average score ICC. Because the complexity of a functional form may result in limited acceptance for practical use, the most appealing feature of a practically useful index is its computational simplicity. To this end, only the estimators with a convenient analytical form are considered here. The following unified expression is presented to accommodate and simplify the diverse individual rating estimators in Gleason (1997) and Harris and Burch (2000):

$$ \widehat{\rho}(c)=\frac{F*-c}{F*+cK-c}, $$
(3)

where c is a constant. Clearly, because \( \widehat{\rho} \) (1) = (F * – 1)/(F * + K – 1) = ICC(1), \( \widehat{\rho} \) (c) includes ICC(1) as a special case when c = 1. More importantly, with the application of Spearman-Brown equation to \( \widehat{\rho} \) (c), a simple closed-form expression is acquired for the suggested class of average score estimators:

$$ \widehat{\rho}*(c)=\Psi \left\{\widehat{\rho}(c)\right\}=1-\frac{c}{F*}. $$
(4)

As expected, it is seen that ICC(2) = \( \widehat{\rho}* \)(1) is a significant member of \( \widehat{\rho}* \)(c). Note that ICC(1) is obtained by replacing variance parameters in population ICC ρ with corresponding unbiased estimators. Because it is often called the ANOVA estimator, for ease of illustration, a specific notation, ICC(2) = \( {\widehat{\rho}}_{AV}^{*}={\widehat{\rho}}^{*}\left({c}_{AV}\right) \) with c AV = 1, is given to denote the particular instance. The other notable choices \( \left\{{\widehat{\rho}}_{MO}^{*},\kern0.5em {\widehat{\rho}}_{ME}^{*},\kern0.5em {\widehat{\rho}}_{EF}^{*},\kern0.5em {\widehat{\rho}}_{ML}^{*}\right\} of\kern0.5em \widehat{\rho}*(c) \) with c = {c MO , c ME , c EF , c ML } considered in Gleason (1997) and Harris and Burch (2000) are given as follows:

  • c MO = {N(N – 3)(K – 1)}/{(N – 1)[N(K – 1) + 2]} is the mode of the F distribution F{N − 1, N(K − 1)} with N − 1 and N(K − 1) degrees of freedom;

  • c ME = F (N − 1), N(K − 1), 0.5, where F (N − 1), N(K − 1), 0.5 is the median of the F distribution F{N − 1, N(K − 1)};

  • c EF = {N(K – 1)}/{N(K – 1) – 2} is the expected value of the F distribution F{N − 1, N(K − 1)};

  • c ML = N/(N – 1) corresponds to the application of the maximum likelihood estimators of σ γ 2 and σ 2 ε .

It should be emphasized that the proposed class of average score estimators not only possesses the major advantage in the ease of application, but also facilitates exact theoretical justification of the associated properties presented in Appendix. It follows from Eqs. A3 and A5 that \( {\widehat{\rho}}_{UB}^{*}={\widehat{\rho}}^{*}\left({c}_{UB}\right)\kern0.5em \mathrm{and}{\widehat{\rho}}_{MS}^{*}={\widehat{\rho}}^{*}\left({c}_{MS}\right) \) are the best unbiased and the best MSE estimators within the considered class of indices, respectively, where

$$ {c}_{UB}=\frac{N-3}{N-1}\kern0.5em \mathrm{and}\kern0.5em {c}_{\mathrm{MS}}=\frac{N\left(N-5\right)\left(K-1\right)}{N-1\left\{N\left(K-1\right)+2\right\}}. $$

Unlike the computational demand of the minimum variance unbiased estimator of individual rating ICC, these two optimal indices of average score ICC are considerably convenient for practical use.

An immediate observation from the estimation properties and optimal solutions is that the conventional ICC(2) is sub-optimal under both the bias and MSE principles, except for the special situation of ρ* = 1. Specifically, the corresponding bias and MSE of ICC(2) are

$$ Bias\left\{\mathrm{I}\mathrm{C}\mathrm{C}(2)\right\}=\frac{-2\left(1-{\rho}^{*}\right)}{N-3}\kern0.5em \mathrm{and}\kern0.5em MSE\left\{\mathrm{I}\mathrm{C}\mathrm{C}(2)\right\}={\left(1-{\rho}^{*}\right)}^2{M}_1, $$
(5)

respectively, where

$$ {M}_1=1\ \hbox{--} \frac{2\left(N-1\right)}{\left(N-3\right)}+\frac{{\left(N-1\right)}^2\left\{N\left(K-1\right)+2\right\}}{N\left(N-5\right)\left(N-3\right)\left(K-1\right)}. $$

This implies that ICC(2) is generally a negatively biased estimator of ρ*, and the absolute bias and MSE become decreasing as the parameter ICC increases for fixed values of N and K. Conversely, the dominant estimators \( {\widehat{\rho}}_{UB}^{*} \) and \( {\widehat{\rho}}_{MS}^{*} \) provide improvement over ICC(2) against the bias and MSE criteria, respectively. It is of both practical value and theoretical interest to further appraise the similarities and differences between the prescribed measures. But due to the complex nature of the resulting bias and MSE, a complete analytical treatment is not feasible. In order to present a comprehensive explication for the relative merits of different indices, a detailed numerical study is conducted next to explore their estimation behavior.

Numerical illustrations

For the purpose of delineating the essential features of the average score ICC indices, an empirical investigation was performed under a wide range of model configurations. The bias and MSE calculations for the considered estimators require complete specifications of the number of groups, N, the number of judges in each group, K, and the underlying population individual rating ICC, ρ. The numerical computations are systematically conducted and accomplished by fixing all but one of the three decisive attributes and varying a single attribute in the assessment. More importantly, the actual bias and MSE were obtained by one-dimensional numerical integration with respect to an F probability distribution function. The numerical integration is theoretically exact provided that the auxiliary function can be evaluated exactly.

Specifically, two different values (10 and 50) are considered for the number of groups and the number of judges and it leads to four combined scenarios of (N, K) = (10, 10), (10, 50), (50, 10), and (50, 50). It is clear from the Spearman-Brown equation that ρ* = Ψ(ρ) = Kρ/{1 + (K – 1)ρ} is a one-to-one function of ρ, and equivalently, ρ = ρ*/{K – (K – 1)ρ*} for a fixed value K > 1. For ease of exposition, the values of ρ are chosen so that the resulting ρ* = 0 to 0.90 with an increment of 0.1 and 0.99. These combinations of model configurations are selected to cover a wide extent of characteristics that are likely to occur in multilevel applications. Overall, the actual performance of bias and MSE of the seven estimators \( \left\{{\widehat{\rho}}_{MS}^{*},\kern0.5em {\widehat{\rho}}_{MO}^{*},\kern0.5em {\widehat{\rho}}_{UB}^{*},\kern0.5em {\widehat{\rho}}_{ME}^{*},\kern0.5em \mathrm{I}\mathrm{C}\mathrm{C}(2),\kern0.5em {\widehat{\rho}}_{EF}^{*},\kern0.5em {\widehat{\rho}}_{ML}^{*}\right\} \) are computed for ten different magnitudes of ρ* for each of the four joined model configurations of two numbers of groups and two group sizes. The bias and MSE results for the four combinations of N and K are summarized in Tables 1, 2, 3 and 4 and Tables 5, 6, 7 and 8, respectively.

Table 1 The bias of average score intraclass correlation coefficient indices for N = 10 and K = 10
Table 2 The bias of average score intraclass correlation coefficient indices for N = 10 and K = 50
Table 3 The bias of average score intraclass correlation coefficient indices for N = 50 and K = 10
Table 4 The bias of average score intraclass correlation coefficient indices for N = 50 and K = 50
Table 5 The mean square error of average score intraclass correlation coefficient indices for N = 10 and K = 10
Table 6 The mean square error of average score intraclass correlation coefficient indices for N = 10 and K = 50
Table 7 The mean square error of average score intraclass correlation coefficient indices for N = 50 and K = 10
Table 8 The mean square error of average score intraclass correlation coefficient indices for N = 50 and K = 50

To provide a concrete illustration, the relative merits between different estimators are represented by the relative absolute bias and the relative MSE, using ICC(2) as a convenient benchmark. It can be shown from the biases given in Eqs. A2 (in the Appendix) and 5 that the relative absolute bias RAB{\( \widehat{\rho}*(c) \)} of estimator ICC(2) with respect to estimator \( \widehat{\rho}*(c) \) is the ratio

$$ RAB\left\{\widehat{\rho}*(c)\right\}=\frac{\left| Bias\left\{\widehat{\rho}*(c)\right\}\right|}{\left| Bias\left\{\mathrm{I}\mathrm{C}\mathrm{C}(2)\right\}\right|}=\frac{\left|c\left(N-1\right)-N+3\right|}{2}. $$
(6)

The MSE formulation in Eq. A4 (Appendix) shows that the relative MSE RMSE{\( \widehat{\rho}*(c) \)} of estimator ICC(2) with respect to estimator \( \widehat{\rho}*(c) \) is the ratio

$$ RMSE\left\{\widehat{\rho}*(c)\right\}=\frac{MSE\left\{\widehat{\rho}*(c)\right\}}{MSE\left\{\mathrm{I}\mathrm{C}\mathrm{C}(2)\right\}}=\frac{M_c}{M_1}, $$
(7)

where M c and M 1 are defined in Eqs. A4 and 5, respectively. It is important to note that the two relative indices RAB{\( \widehat{\rho}*(c) \)} and RMSE{\( \widehat{\rho}*(c) \)} do not depend on the underlying population ρ*. For a designated estimator \( \widehat{\rho}*(c) \), the associated values of RAB{\( \widehat{\rho}*(c) \)} and RMSE{\( \widehat{\rho}*(c) \)} only vary with the selection of N and K, and serve the comparison purpose for the practical situation that the underlying population ρ* is unknown. The selected constant c, computed relative absolute bias and relative MSE are also reported in the tables.

An inspection of the results in the presented tables shows that the chosen constants of the estimators \( \left\{{\widehat{\rho}}_{MS}^{*},\kern0.5em {\widehat{\rho}}_{MO}^{*},\kern0.5em {\widehat{\rho}}_{UB}^{*},\kern0.5em {\widehat{\rho}}_{ME}^{*},\kern0.5em \mathrm{I}\mathrm{C}\mathrm{C}(2),\kern0.5em {\widehat{\rho}}_{EF}^{*},\kern0.5em {\widehat{\rho}}_{ML}^{*}\right\} \) have a consistent order that c MS < c MO < c UB < c ME < c AV < c EF < c ML for all four settings of N and K. Specifically, the actual values of {c MS , c MO , c UB , c ME , c AV , c EF , c ML } are

  • {0.5435, 0.7609, 0.7778, 0.9339, 1.0000, 1.0227, 1.1111} for (N, K) = (10, 10);

  • {0.5533, 0.7746, 0.7778, 0.9283, 1.0000, 1.0041, 1.1111} for (N, K) = (10, 50);

  • {0.9143, 0.9549, 0.9592, 0.9879, 1.0000, 1.0045, 1.0204} for (N, K) = (50, 10);

  • {0.9176, 0.9584, 0.9592, 0.9867, 1.0000, 1.0008, 1.0204} for (N, K) = (50, 50).

It follows that the resulting values of {c MS , c MO , c ME , c EF } vary with (N, K), the two components {c UB , c ML } only depend on N, and c AV = 1 has a fixed value.

For the relative absolute biases in Tables 1 and 3 with K = 10, they reveal the order

RAB{\( {\widehat{\rho}}_{UB}^{*} \)} < RAB{\( {\widehat{\rho}}_{ME}^{*} \)} < RAB{\( {\widehat{\rho}}_{MO}^{*} \)} < RAB{ICC(2)} < RAB{\( {\widehat{\rho}}_{MS}^{*} \)} < RAB{\( {\widehat{\rho}}_{EF}^{*} \)} < RAB{\( {\widehat{\rho}}_{ML}^{*} \)}.

When K = 50, only the relative absolute biases between \( {\widehat{\rho}}_{ME}^{*} \) and \( {\widehat{\rho}}_{MO}^{*} \) are switched in Tables 2 and 4:

  • RAB{\( {\widehat{\rho}}_{UB}^{*} \)} < RAB{\( {\widehat{\rho}}_{MO}^{*} \)} < RAB{\( {\widehat{\rho}}_{ME}^{*} \)} < RAB{ICC(2)} < RAB{\( {\widehat{\rho}}_{ME}^{*} \)} < RAB{\( {\widehat{\rho}}_{EF}^{*} \)} < RAB{\( {\widehat{\rho}}_{ML}^{*} \)}.

Obviously, RAB{\( {\widehat{\rho}}_{UB}^{*} \)} = 0 because \( {\widehat{\rho}}_{UB}^{*} \) is an unbiased index of ρ*. The resulting biases imply that \( {\widehat{\rho}}_{MS}^{*} \) and \( {\widehat{\rho}}_{MO}^{*} \) are positively biased, while \( {\widehat{\rho}}_{ME}^{*} \), ICC(2), \( {\widehat{\rho}}_{EF}^{*} \), and \( {\widehat{\rho}}_{ML}^{*} \) tend to underestimate ρ*. In addition, RAB{\( {\widehat{\rho}}_{MO}^{*} \)} and RAB{\( {\widehat{\rho}}_{ME}^{*} \)} are consistently less than RAB{ICC(2)} = 1. It is also interesting to see that RAB{\( {\widehat{\rho}}_{MS}^{*} \)} and RAB{\( {\widehat{\rho}}_{EF}^{*} \)} are marginally larger than RAB{ICC(2)}, and RAB{\( {\widehat{\rho}}_{ML}^{*} \)} = 1.5 for all N and K. Hence, \( {\widehat{\rho}}_{UB}^{*} \), \( {\widehat{\rho}}_{MO}^{*} \) and \( {\widehat{\rho}}_{ME}^{*} \) perform better than ICC(2) under the bias principle.

On the other hand, the relative MSEs have the same order between the c constants for all four combinations of N and K:

  • RMSE{\( {\widehat{\rho}}_{MS}^{*} \)} < RMSE{\( {\widehat{\rho}}_{MO}^{*} \)} < RMSE{\( {\widehat{\rho}}_{UB}^{*} \)} <

  • RMSE{\( {\widehat{\rho}}_{ME}^{*} \)} < RMSE{ICC(2)} < RMSE{\( {\widehat{\rho}}_{EF}^{*} \)} < RMSE{\( {\widehat{\rho}}_{ML}^{*} \)}.

The consistency indicates that \( {\widehat{\rho}}_{MS}^{*} \) incurs the smallest RMSE, whereas \( {\widehat{\rho}}_{ML}^{*} \) has the largest value. However, there are minor differences between RMSE{\( {\widehat{\rho}}_{MO}^{*} \)} and RMSE{\( {\widehat{\rho}}_{UB}^{*} \)} or RMSE{ICC(2)} and RMSE{\( {\widehat{\rho}}_{EF}^{*} \)}. The evidence suggests that ICC(2) is not a prudent selection because RMSE{\( {\widehat{\rho}}_{ML}^{*} \)} is the only case that is substantially greater than RMSE{ICC(2)}. In short, the four measures \( {\widehat{\rho}}_{MS}^{*} \), \( {\widehat{\rho}}_{MO}^{*} \), \( {\widehat{\rho}}_{UB}^{*} \), and \( {\widehat{\rho}}_{ME}^{*} \) dominate ICC(2) in terms of MSE.

Comparatively, \( {\widehat{\rho}}_{EF}^{*} \) and \( {\widehat{\rho}}_{ML}^{*} \) are the worst two estimators based on both estimation criteria of bias and MSE. Although \( {\widehat{\rho}}_{ML}^{.} \) has been a competitively accurate index of ρ when the individual rating ICC is small (Shieh, 2012), the counterpart \( {\widehat{\rho}}_{ML}^{*} \) appears to be unsatisfactory in estimating ρ* regardless of the true magnitude of the average score ICC. More importantly, \( {\widehat{\rho}}_{MO}^{*} \), \( {\widehat{\rho}}_{UB}^{*} \), and \( {\widehat{\rho}}_{ME}^{*} \) outperform ICC(2) under both bias and MSE considerations, and \( {\widehat{\rho}}_{MS}^{*} \) provides a strong alternative to ICC(2) with respect to MSE criterion. The conventional use of ICC(2) for the estimation of mean rating ICC was not supported both analytically and empirically. For the primary reasons of statistical efficiency and computational ease, it is sensible to employ simple and robust alternatives.

In general, the performance of the prescribed estimators improves with an increasing ρ, a larger number of groups N, or a greater group size K when all other features remain constant. The only exception is the case of \( {\widehat{\rho}}_{UB}^{*} \) because it is unbiased for all N > 3, K > 1, and 0  ≤  ρ < 1. Although the specific magnitudes of N and K have a concurrent impact on estimation behavior, the influence of the number of groups differs from that of the group size. Note that the settings (N, K) = (10, 50) and (50, 10) of the one-way random effects model have the identical total sample size 500. But the accuracy and efficiency of the seven estimators in Tables 3 and 7 with (N, K) = (50, 10) are consistently better than the corresponding results in Tables 2 and 6 with (N, K) = (10, 50). Consequently, the discrepancy between the numerical assessments implies that an increase in the number of group, rather than the number of judges in each group, yields more pronounced improvement in estimation for a given total sample size. The particular phenomenon is also confirmed by the additional estimation performance of the two configurations of (N, K) = (20, 25) and (25, 20) with the same total sample size 500. Accordingly, this finding may be useful for researchers to justify their allocation scheme for advance design planning of reliability studies.

Discussion and conclusions

This article concerns the use of ICC(2) as an average score ICC measure within the context of one-way random effects model. Despite its routine and common application in research across many scientific fields, the fundamental properties of the ICC(2) formula is seldom addressed. Although ICC(1) and ICC(2) are conceptually distinct indices, ICC(2) may have been treated as a trivial exercise of the Spearman-Brown equation to ICC(1). This research contributes to the reliability literature by considering the various issues in choosing the best average score ICC index with analytic clarifications and numeric expositions.

First, the estimation of the average score ICC is appropriately recognized as a unique and distinct task in reliability research. The essential attributes of individual rating ICC estimators and the Spearman-Brown formula are synthesized to present a family of average score ICC indices that subsumes ICC(2) as a special case. Accordingly, the choice of the suggested formulation is motivated by its advantages of methodological transparency, analytic tractability, and computational simplicity. Second, exact estimation properties of the suggested class of measures are derived to facilitate the comparison of the strengths and weaknesses of different indices. Within the proposed class of estimators, the best unbiased and the best MSE mean rating ICC indices are identified. Consequently, the theoretical implication and computational ease of the superior alternatives strongly suggest that ICC(2) is sub-optimal and its practical value appears to be difficult to justify.

A potential deficiency of ICC(2) and other average score ICC estimators is that they can assume negative values even though ICC is defined as a non-negative parameter. In practice, the estimate is often set equal to zero when this occurs. Although this simple and intuitive adjustment is of practical meaning, the fundamental behavior of the estimator is inherently altered and a single simple formula of bias and MSE cannot be obtained. However, the conducted Monte Carlo simulation study showed that the bias and MSE performance of the average score ICC indices is essentially unchanged unless the population individual rating ICC is extremely small. A simple explanation is that it is possible to take trivial values of ICC(1) and obtain sizeable values of ICC(2). The occurrence of truncation is less often for the average score ICC indices than the corresponding individual rating ICC estimators. Hence, the numerical details are not reported here.

According to the editorial guidelines and methodological recommendations of several prominent educational and psychological journals, it is necessary to include some measures of effect size and confidence intervals for all primary outcomes (Alhija & Levy, 2009; Fritz, Morris & Richler, 2012; Kelley & Preacher, 2012; Odgaard & Fowler, 2010; Peng et al., 2013; Sun, Pan & Wang, 2010). Among the several inadequate effect size reporting and interpretation practices, Alhija and Levy (2009) and Peng et al. (2013) especially emphasized that the majority of popular effect size measures are positively biased estimators such as the standardized mean difference index Cohen’s d, the strength of association measure \( {\widehat{\eta}}^2 \), and the sample squared multiple correlation coefficient R 2. These indices are obtained by replacing population parameters with corresponding sample statistics. However, a combination of unbiased component estimators does not necessarily yield an unbiased whole estimator. To expedite the advocated reform of statistical reporting practices, researchers should prudently apply unbiased estimators or other improved formulas in the selection and computation of appropriate effect size measures. Note that unbiasedness is not the only criterion of theoretical importance. Mean square error is another useful performance criterion obtained by incorporating the bias (accuracy) and variability (precision) of an estimator. A thorough explication and comparison of effect sizes under various frameworks certainly facilitate assessment of scientific findings and accumulation of advanced knowledge. This research provides an update and explication of different average score ICC indices that helps to clarify the issue of evaluating the strength of the group property and how to choose an appropriate effect size estimate in multilevel analysis. On the other hand, a thorough coverage of inferential procedures is presented in McGraw and Wong (1996) for hypothesis testing and interval estimation of various average score ICCs.