Cognition is an umbrella term that refers to various higher-order behavioral abilities, such as memory, attention, and executive functions (Lezak, Howieson, & Loring, 2004). These higher-order behavioral abilities are latent variables that cannot be directly observed. Instead, they have to be inferred from proxy measures (Mitrushina, Boone, Razani, & D’Elia, 2005). For example, a person’s verbal memory cannot be directly observed; what can be observed is the person’s ability to recall verbal material that is presented in a specific standardized test setting.

Cognitive assessment is widely used in medical settings and in the behavioral sciences—for example, in clinical psychology, educational practice, and rehabilitation settings (Lezak et al., 2004; Pasquier, 1999). In diagnostic settings, the “raw” score of a person on a cognitive test (e.g., the number of items that were recalled in a memory test) is usually not of direct interest. The reason for this is that the raw scores on cognitive tests are strongly affected by demographic variables (such as age and educational level; Mitrushina et al., 2005; Strauss, Sherman, & Spreen, 2006; Van der Elst, 2006). For example, the same raw test score may be indicative of a severe memory problem in a 50-year-old person, while it is within the normal limits of test performance for an 80-year-old person (Van der Elst, Van Boxtel, Van Breukelen, & Jolles, 2005). Clinicians therefore use relative measures (rather than raw test scores) to evaluate a patient’s test performance (e.g., what is the percentage of demographically matched “cognitively healthy” peers who obtain a test score that is equal to or worse than the test score of this patient?). So-called normative data are used to convert raw test scores into demographically corrected relative measures (Mitrushina et al., 2005; Van der Elst, 2006).

In many diagnostic situations, the same cognitive test (or a parallel test version) is repeatedly administered to the same person. For example, a clinician may need to determine whether a patient with mild cognitive impairment has experienced cognitive decline since his or her last evaluation, or a clinician may need to evaluate whether a stroke patient has benefited from taking part in a rehabilitation program. Ideally, the observed changes in the test scores at subsequent measurement occasions would be directly interpretable in terms of true changes in the latent cognitive trait of interest. This is, however, generally not the case (Calamia, Markon, & Tranel, 2012). The main reason for this is that practice effects occur in serial testing situations. Practice effects refer to a variety of factors—such as procedural learning, memory for specific items, and increased comfort with formal testing situations (McCaffrey, Duff, & Westervelt, 2000)—that result in systematic improvements in test scores at retesting occasions, even though there was no true change in the latent trait that is measured by the cognitive test (Bartels, Wegrzyn, Wiedl, Ackermann, & Ehrenreich, 2010; Calamia et al., 2012; Dikmen, Heaton, Grant, & Temkin, 1999; Temkin, Heaton, Grant, & Dikmen, 1999; Van der Elst, Van Breukelen, Van Boxtel, & Jolles, 2008). Practice effects are especially pronounced when the test–retest intervals are short (e.g., Theisen, Rapport, Axelrod, & Brines, 1998), but they also occur in studies with test–retest intervals of several years (Rönnlund & Nilsson, 2006; Salthouse, Schroeder, & Ferrer, 2004). In the latter case, the changes in the test scores over time reflect the combined influences of practice effects and true changes in the latent cognitive abilities (Van der Elst et al., 2008). Furthermore, the extent to which practice effects occur is affected by person characteristics such as the age and the educational level of a tested person (Mitrushina & Satz, 1991; Rapport, Brines, Axelrod, & Theisen, 1997; Stuss, Stethem, & Poirier, 1987; Van der Elst et al., 2008).

Failure to take practice effects into account may invalidate the conclusions that are drawn from a serial cognitive assessment (Calamia et al., 2012; Van der Elst et al., 2008). For example, practice effects may mask the cognitive decline in a patient with early dementia, or practice effects may lead to the incorrect conclusion that a stroke patient has benefitted from a rehabilitation program. Normative data for serial cognitive assessment should thus take the testing history and the demographic characteristics of a patient into account, but it is not clear which statistical method is optimal for achieving this aim (Heaton et al., 2001; Temkin et al., 1999; Van der Elst et al., 2008).

Existing normative methods

In nonserial (i.e., single-measurement) cognitive testing situations, normative data are established on the basis of classical univariate statistical methods. For example, an often-used procedure is the regression-based normative approach (Testa, Winicki, Pearlson, Gordon, & Schretlen, 2009; Van Breukelen & Vlaeyen, 2005; Van der Elst et al., in press; Van der Elst et al. 2006a, 2006b, 2006c, 2006d). In this method, a classical multiple linear regression model is fitted to the data of a large sample of cognitively healthy people who were administered the cognitive test of interest (the normative sample). The multiple linear regression model assumes that \( {Y_i}={x_i}\beta +{\varepsilon_i} \) where Y i is the vector of the responses, X i is the design matrix (which typically includes age, gender, and educational level in normative studies), β is the vector of regression parameters, and ε i is the vector of the residual components (for details on this model, see, e.g., Kutner, Nachtsheim, Neter, & Li, 2005).

On the basis of the established regression model, the test performance of a future patient j can be evaluated. This requires three steps. First, the expected test score of patient j is computed (i.e., \( \widehat{{{Y_j}}}={X_j}\widehat{\beta} \)). This score reflects the expected test score for a cognitively healthy person who has the same demographic background as the tested patient. Second, the difference between the patient’s observed and expected test scores is computed (i.e., \( {e_j}={Y_j}-\widehat{{{Y_j}}} \)) and standardized (i.e., \( {e_j}={Y_{j- }}\widehat{{{Y_j}}} \)). The SD(e) is the SD of the residuals in the normative sample (which is approximately equal to the positive square root of the residual mean squares). Third, the standardized residual of the patient is converted into a percentile value as based on the distribution of the standardized residuals in the normative sample. A percentile value below 5 is often considered as being indicative of a cognitive problem (because 95 % of the “cognitively healthy” people perform better).

An important assumption of the classical linear regression model is that \( {\sigma^2}\left\{ \varepsilon \right\}={\sigma^2}I \) (with I = an n × n identity matrix). Thus, it is assumed that the residuals (or equivalently, the responses) are uncorrelated. This assumption is not realistic in serial cognitive-testing situations, because the cognitive test scores at subsequent measurement occasions tend to be highly correlated within individuals (Dikmen et al., 1999; Lezak et al., 2004; Temkin et al., 1999; Van der Elst et al., 2008). One possible solution for dealing with this problem is to summarize the vector of the repeated measurements into change scores (change = endpoint score − baseline score) and, subsequently, regress these responses on the demographic covariates of interest in the normative sample (the regression-based change approach). Alternatively, the dependence issue can be solved by fitting a model in which the endpoint scores are regressed on the baseline scores and the demographic covariates in the normative sample (the ANCOVA approach).

Motivating example

To illustrate the problems with the existing normative methods and to exemplify the newly proposed methods (see below), data from the Maastricht Aging Study (MAAS) are used. The MAAS is a longitudinal research project into the determinants of cognitive aging (Jolles, Houx, Van Boxtel, & Ponds, 1995). The MAAS baseline measurement took place between 1993 and 1996, and three follow-up measurements were conducted (3, 6, and 12 years after baseline). All participants were thoroughly screened for medical pathology that could interfere with normal cognition, such as dementia or cerebrovascular disease.

The MAAS participants were administered an extensive battery of cognitive and medical tests. In the present article, we will focus on the data of the Stroop Color Word Test (SCWT; Stroop, 1935). The SCWT is a well-known cognitive paradigm that is used to assess inhibition and other components of executive functioning (Lezak et al., 2004; Moering, Schinka, Mortimer, & Graves, 2004). The test consists of three subtasks. The first subtask shows color words in random order (red, blue, yellow, green) that are printed in black ink. The second subtask displays solid color patches in one of these four basic colors. The third subtask contains color words that are printed in an incongruous ink color (e.g., the word “red” printed in yellow ink). The participants are instructed to read the words, name the colors, and name the ink color of the printed words as quickly and as accurately as possible in the three subsequent subtasks. The SCWT outcome variable of interest is the difference between the time that is needed to complete subtask three and the average time that is needed to complete the first two subtasks [i.e., SCWT score = time in seconds needed for subtask 3 − (time in seconds needed for subtasks 1 + 2) / 2]. Higher SCWT scores are thus indicative of worse test performance.

In the MAAS, the SCWT was administered to N = 887, N = 696, N = 614, and N = 454 participants at the subsequent measurement occasions. Missingness in the responses was thus substantial. Basic demographic data for the sample at baseline and at the three follow-up measurement occasions are provided in Table 1. Level of Education (LE) was categorized into three levels using a classification scheme that is often used in the Netherlands (De Bie, 1987), with low = at most primary education, average = at most junior vocational training, and high = senior vocational or academic training. More details regarding the SCWT and the sample frame, participant recruitment, stratification criteria, and other aspects of the MAAS can be found elsewhere (Jolles et al., 1995; Van der Elst, 2006).

Table 1 Demographic characteristics of the participants who were administered the Stroop Color Word Test at the different measurement occasions

Limitations of the existing normative methods

Suppose that the regression-based change method or the ANCOVA approach were used to establish normative data for serial SCWT administration (as based on the MAAS data). This would have two major drawbacks.

First, the ANCOVA and the regression-based change approaches cannot handle missing data appropriately. Both methods simply discard incomplete cases from the analyses, but a complete case analysis is unbiased only when the responses are Missing Completely At Random (MCAR; Little & Rubin, 1987; Rubin, 1976), and even then it is usually inefficient (Verbeke & Molenberghs, 2000). MCAR means that the probability of an observation being missing is independent of the observed or unobserved responses. The MCAR assumption is not realistic in most serial testing settings. For example, the probability that a participant drops out of the MAAS is strongly affected by his or her baseline cognitive test score (Van Beijsterveldt et al., 2002), and thus the MCAR assumption is not valid.

Second, the regression-based change and the ANCOVA methods can only use the data for a maximum of two measurement occasions. In the MAAS, the SCWT was administered four times. The application of the regression-based change or the ANCOVA approach would thus result in a substantial loss of information and, consequently, a lowered precision of the parameter estimates and a loss of power (Verbeke & Molenberghs, 2000). Note that it might be argued that the endpoint score could be regressed on the test scores of multiple earlier testing occasions in the ANCOVA method (rather than on a single one), but this is generally not the case because the test scores at subsequent measurement occasions are highly correlated and, thus, collinearity issues would arise.

Alternative normative methods: Multivariate regression, the standard linear mixed model, and the linear mixed model combined with multiple imputation

As was noted in the previous sections, normative data for serial cognitive assessment should take the testing history and the demographic characteristics of a patient into account, but it is not clear which statistical method is optimal to achieve this aim. The existing methods (i.e., the regression-based change and the ANCOVA methods) are fundamentally flawed. Applying these methods to the SCWT data (from the MAAS) would lead to a substantial loss of information and biased results. What we need are (1) methods that can deal with two or more correlated responses (within individuals) and (2) methods that can handle missing data appropriately.

On the basis of these criteria, the use of the multivariate regression model, standard linear mixed model, and linear mixed model with multiple imputation approaches are proposed in the present article. These methods are described in the next sections.

The multivariate regression model

The multivariate regression model assumes that \( {Y_i}={X_i}\beta +{\varepsilon_i} \) with Y i = the vector of the repeated measurements for subject i (1 ≤ i ≤ N, with N = the number of subjects), X i = the design matrix, β = the vector of the regression parameters, and ε i = the vector of the error components. It is assumed that \( \varepsilon \sim N(0,\sum ) \), with 0 = a zero matrix and = a general (unstructured) variance—covariance matrix of the residuals (for details on the model, see Johnson & Wichern, 2007).

In contrast to the classical (or univariate) linear regression model, the multivariate regression model can handle vectors of repeated observations for individuals. The parameter estimates in the multivariate regression model are based on likelihood methods, which allow for using all available outcomes in the calculations (Molenberghs & Kenward, 2007). Moreover, the use of likelihood-based methods has the advantage that inferences can be based on the observed likelihood given a model that does not include a distribution for the missing data mechanism (Little & Zhang, 2011; Molenberghs & Verbeke, 2005; Verbeke & Molenberghs, 2000). These so-called ignorable analyses require that the missingness mechanism is Missing At Random (MAR; i.e., the probability of an observation being missing is independent of the unobserved outcomes conditional on the observed data) or MCAR (as defined above) when likelihood or Bayesian inferences are chosen, though this assumption can be relaxed in the context of normative analyses (see the Discussion section). Note that the parameter estimates in a multivariate regression model can also be based on ordinary least squares methods (rather than on likelihood-based methods), but this situation will not be considered here because it largely suffers from the same drawbacks as the regression-based change and the ANCOVA methods.

The standard linear mixed model

The random-effects approach toward extending the classical linear regression model to a longitudinal setting is based on the assumption that the responses of a participant can be appropriately modeled on the basis of a linear regression model in which subject-specific regression coefficients are used (Verbeke & Molenberghs, 2000). In particular, the standard Linear Mixed Model (LMM) assumes that \( {Y_i}={x_i}\beta +{z_i}{b_i}+{\varepsilon_i}, \) with Y i = the vector of the repeated measurements for subject i (1 ≤ i ≤ N, with N = the number of subjects), X i = the design matrix for the fixed effects (i.e., the population-averaged parameters), β = the vector of regression coefficients, Z i = the design matrix for the subject-specific effects (capturing how individuals deviate from the population average, where the population is understood as any subject with the same fixed-effect design), b i = the vector of the random effects, and ε i = the vector of the residual components. Because the participants in a study are a random sample of a larger population, it is natural to assume that b i ~ N (0, D), where 0 is a zero matrix and D is a general variance—covariance matrix. It is furthermore assumed that \( {\varepsilon_i}\sim N\left( {0,{\sum_i}} \right), \) where i is a variance—covariance matrix (which was chosen to be equal to \( {\sigma^2}{I_{{{n_i}}}} \) in the present study, with \( {I_{{{n_i}}}} \) = an identity matrix of dimension n i ). The random components b 1 ,…,b n , ε 1 ,…,ε N are assumed to be independent. For more details on the estimation and inference for the marginal model and the variance components in a LMM, the reader is referred to Verbeke and Molenberghs (2000).

As compared with the multivariate regression model, the standard LMM has the additional advantage that both fixed and random effects can be included in the model. Random effects are not of substantive interest in normative studies (the focus is on the marginal evolutions—i.e., on the fixed effects), but it is nevertheless useful to model the covariance structure adequately, because this generally leads to more efficient inferences for the fixed effects (i.e., smaller standard errors; Verbeke & Molenberghs, 2000). This is particularly important in the case of unbalanced data—that is, when different subjects provide different numbers of outcome values (either by design or because of missingness in the data). As was also the case for the multivariate regression model, the standard LMM has the advantage that it can take the uncertainty of dealing with missing values into account in the analyses (Verbeke & Molenberghs, 2000).

The linear mixed model combined with multiple imputation

In the LMM combined with Multiple Imputation (LMM with MI) approach, the MI algorithm is first applied to fill in the missing observations in the data set. The key idea of MI is to replace each missing value with M plausible values (Rubin, 1996). Each value can be seen as a Bayesian drawn from the conditional distribution of the unobserved responses given the observed ones (Beunckens, Molenberghs, & Kenward, 2005; Little & Rubin, 1987).

To fix ideas on this method, let us consider a problem where we have two unknown parameters γ1 and γ2 and a data set y. In a Bayesian context, these have a joint posterior distribution \( \mathrm{f}\left( {{\gamma_1},{\gamma_2}\left| y \right.} \right). \) Suppose that parameter γ2 is of interest and that γ1 is a nuisance parameter (i.e., a parameter that is not of substantive interest but that has to be accounted for in the analysis). The posterior can be partitioned as \( \mathrm{f}\left( {{\gamma_1},{\gamma_2}\left| y \right.} \right)=\mathrm{f}\left( {{\gamma_1}\left| y \right.} \right)\mathrm{f}\left( {{\gamma_2}\left| {{\gamma_1}} \right.,y} \right) \), so it follows that the marginal posterior for γ2 can be expressed as \( \mathrm{f}\left( {{\gamma_2}\left| y \right.} \right)=E{\gamma_1}\left( {\mathrm{f}\left( {{\gamma_2}\left| {{\gamma_1},y} \right.} \right)} \right), \) with \( E{\gamma_1} \) = the expectation over the distribution of γ1. The posterior mean and variance for γ2 equal \( \mathrm{E}\left( {{\gamma_2}\left| y \right.} \right)={E_{{\gamma 1}}}\left( {E{\gamma_2}\left( {{\gamma_2}\left| {{\gamma_1},y} \right.} \right)} \right) \) and \( \operatorname{var}\left( {{\gamma_2}\left| y \right.} \right)={E_{{\gamma 1}}}\left( {{{{\operatorname{var}}}_{{\gamma 2}}}({\gamma_2}\left| {{\gamma_1},y} \right.)} \right)+{{\operatorname{var}}_{{{\gamma_1}}}}\left( {{E_{{{\gamma_2}}}}\left( {{\gamma_2}\left| {{\gamma_1}y} \right.} \right)} \right), \) with var γ2 = the variance computed over the distribution of γ2. These quantities can be approximated by way of empirical moments. Let \( \gamma \mathop{1}\limits^m \) be draws from the marginal posterior distribution of γ1 for m = 1, . . . , M. It holds that \( E\left( {{\gamma_2}\left| y \right.} \right)\cong \frac{1}{M}\sum {_{\mathrm{m}=1}^{\mathrm{M}}} \left( {{{\mathrm{E}}_{{\gamma 2}}}({\gamma_2}\left| {\gamma_1^m,\mathrm{y}} \right.)} \right)+\frac{1}{{\mathrm{M}-1}}\sum_{\mathrm{m}=1}^{\mathrm{M}}({{\mathrm{E}}_{{\gamma 2}}}{{\left( {{\gamma_2}\left| {\gamma_1^m,\mathrm{y}} \right.)-\mathop{{{\gamma_2}}}\limits^{\sim }} \right)}^2} \). Defining the right hand side of the previous equation as \( \mathop{{{\gamma_2}}}\limits^{\sim } \), it furthermore holds that \( \operatorname{var}({\gamma_2}\left| \mathrm{y} \right.)=\frac{1}{M}\sum_{\mathrm{m}=1}^{\mathrm{M}}{{\operatorname{var}}_{{{\gamma_2}}}}({\gamma_2}\left| {\gamma_1^{\mathrm{m}},\mathrm{y}} \right.)+\frac{1}{{\mathrm{M}-1}}\sum_{\mathrm{m}=1}^{\mathrm{M}}{{\left( {{{\mathrm{E}}_{{\gamma 2}}}({\gamma_2}\left| {\gamma_1^{\mathrm{m}},\mathrm{y}} \right.)-\mathop{{{\gamma_2}}}\limits^{\sim }} \right)}^2} \). In the MI procedure, these formulas are generalized for vector-valued parameters, and γ 2 is used to represent the substantive model where γ 1 is used to represent the missing data. In sufficiently large samples, the conditional posterior moments for γ 2 can be approximated by maximum likelihood estimators from the completed data set. The MI estimates of the parameters of interest and their variance approximate the first two moments of the posterior distribution in a fully Bayesian analysis (Kenward & Carpenter, 2007).

After imputing the missing values M times using Bayesian draws, a LMM analysis is conducted on each of the completed data sets (or any analysis of interest in a context other that the one considered here), and the different inferences are subsequently combined into a single one (for more details on MI, see chap. 9 of Molenberghs & Kenward, 2007).

As was also the case with the multivariate regression model and the standard LMM, the LMM with MI takes the uncertainty of dealing with missing values into account in the analyses (Rubin, 1996; Verbeke & Molenberghs, 2000). This is to be contrasted with so-called simple imputation methods, in which each missing response is substituted by a single value (Molenberghs & Verbeke, 2005). The standard LMM and the LMM with MI methods are largely equivalent (provided that the imputation model includes all relationships that will be considered in the analyses and the inference tasks; Molenberghs & Kenward, 2007), but the MI method allows for some additional flexibility in dealing with complex data sets. That is, MI can be used to deal with missing covariate values and missing responses at the same time (Molenberghs & Kenward, 2007; Molenberghs & Verbeke, 2005).

Application to the motivating example

In this section, we will illustrate the use of the multivariate regression, standard LMM, and LMM with MI methods to establish norms for serial SCWT administration (as based on the MAAS data described earlier). All analyses were conducted with R 2.14.0 for OS X and SAS v9.2 for Windows. An α-level of .05 was used.

The multivariate regression model

The initial multivariate regression model included the vector of the log(SCWT) scores as the outcome and age, age2, gender, LE low, LE high, time, and time2 as the covariates. The SCWT score was log-transformed because preliminary analyses showed that the residuals were positively skewed. Age was centered (age = calendar age in years − 65) prior to the computation of the quadratic age effect (to avoid multicollinearity; Kutner et al., 2005). Gender was coded as 1 = male and 0 = female. The three levels of education (LEs) were coded with two dummies—that is, LE low, 1 = at most primary education and 0 = otherwise; and LE high, 1 = senior vocational or academic training and 0 = otherwise. Time was dummy coded using three dummies and baseline measurement as the reference category. In addition to the main effects, the age × time, age × time2, age2 × time, age2 × time2, LE low × time, LE high × time, LE low × time2, and LE high × time2 interaction terms were included in the mean structure of the initial model. This was done because previous studies have suggested that older age and lower LEs are associated with a more pronounced cognitive decline over time (see, e.g., Salthouse, 1996; Schmand, Smit, Geerlings, & Lindeboom, 1997; Stern, 2003; Van der Elst et al., 2006d).

The initial model had a 2 l value that equaled 648.2 (see model 1 in Table 2). To obtain the most parsimonious model, it was first evaluated whether the mean structure of the initial model could be simplified by removing interactions and main effect terms. Likelihood ratio tests suggested that the model fit did not significantly deteriorate when the LE × time and the age2 × time interaction terms were removed from the model (all ps > .05; see models 2 and 3 in Table 2). Next, it was evaluated whether the mean structure could be simplified by assuming linear and quadratic effects of time (instead of using dummies to model the effects of time). In these models, time was centered (time = time since baseline in years − 5.25) prior to the computation of the quadratic terms (to avoid multicollinearity). The likelihood ratio tests suggested that the models in which linear (model 4) and quadratic (model 5) time effects were assumed both adequately fitted the data (using model 3 as the comparison model). The linear, rather than the quadratic, model was retained because it is more parsimonious.

Table 2 Likelihood ratio tests to evaluate the fit of a series of nested multivariate regression models

Next, it was evaluated whether age group-, gender-, or LE-specific covariance structures were needed. Age group was constructed on the basis of a median split of the continuous variable age (i.e., younger = ≤65 years at baseline, older = >65 years at baseline). Smoothed (loess) average trends of the squared ordinary least square residuals \( \left\{ {{\sigma^2}(t)=E\left[ {Y(t)-\widehat{\mu}{(t)^2}} \right]} \right\} \) were plotted for the different subgroups (using age, age2, gender, LE low, LE high, time, and age × time as covariates in the model). As is shown in Fig. 1, the residual variances for older people tended to be higher, as compared with the residual variances for younger people, at most of the measurement moments. A separate residual variance–covariance matrix was thus fitted for older and younger people (model 6), and this model indeed had a significantly better fit to the data than did model 4 (see Table 2). The variance functions for males and females and for people with a low, average, and high LE were similar (figures not shown), suggesting that gender- and LE-specific covariance structures are not needed.

Fig. 1
figure 1

Scatterplots of the squared residuals and smoothed variance functions for a younger participants (≤65 years at baseline) and b older participants (>65 years at baseline). Note that only data points that had squared residual values that were below 0.5 are presented (to depict the variance functions more clearly)

The most parsimonious multivariate regression model that still adequately fitted the data was thus model 6. The parameter estimates for this model are provided in Table 3a. As is shown, males and lower educated participants had significantly higher log(SCWT) scores at all measurement moments. There was a significant time × age interaction term, which suggested that the increase in the log(SCWT) scores over time was more pronounced for people who were older at baseline. The interaction is graphically depicted in Fig. 2a for 50-, 65-, and 80-year-old females with an average LE (note that the shape of these plots is identical for males and for people with a low or a high educational level—i.e., the predicted log(SCWT) values are the same up to a constant). There was also a small (but significant) effect of age2.

Table 3 The final multivariate regression model (a), standard linear mixed model (b), and linear mixed model with multiple imputation (c)
Fig. 2
figure 2

Predicted marginal evolutions of the log(SCWT) values over time as based on a the multivariate regression model, b the standard LMM, and c the LMM with MI. The plots show the predicted values for 50-, 65- and 80-year-old females (at baseline) with an average level of education. SCWT = Stroop Color Word Test, LMM = linear mixed model, and MI = multiple imputation

The standard linear mixed model

The preliminary mean structure of the initial standard LMM was identical to the mean structure in the initial multivariate regression model (see above). A random intercept and two random slopes (for time and time2) were included in the preliminary covariance structure (unstructured type). We first evaluated whether the random effects were all needed in the model, by removing one random effect after the other in a hierarchical way. Note that these tests cannot be conducted by using classical likelihood ratio procedures. Instead, a mixture of two X 2 distributions should be used (with equal weights of 0.5; Verbeke & Molenberghs, 2000). The p-values of all the 2 l difference scores were significant (all ps < .05; data not shown), indicating that the covariance structure could not be simplified by deleting random effects from the model.

Next, the nonsignificant fixed-effect terms were removed from the model (one after the other, in a hierarchical way) to obtain a more parsimonious mean structure. This procedure yielded a model that included age, age2, gender, LE low, LE high, time, and the age × time interaction as the covariates (see models 2 to 7 in Table 4). Finally, age-group-specific (model 8) and age group × time-specific (model 9) covariance structures were requested. The difference between models 8 and 9 is that model 8 makes the assumption that the differences in the estimated residual variances for the two age groups remain constant as a function of time (i.e., parallel variance functions for older and younger people are assumed), whilst model 9 does not make this assumption. As is shown in Table 4, model 9 had the best fit to the data.

Table 4 Likelihood ratio tests to evaluate the fit of a series of nested standard linear mixed models

The parameter estimates for the final standard LMM (model 9) are presented in Table 3b. In agreement with the results of the multivariate regression model, being male and having a lower LE were associated with higher log(SCWT) scores at all measurement moments. There was again a significant age × time interaction, which suggested that the increase in the log(SCWT) scores over time was more pronounced for people who were older at baseline (see Fig. 2b). The effect of age2 was small but significant.

The linear mixed model with multiple imputation

The Markov Chain Monte Carlo method was used in the MI process to replace the missing values by 10 different imputations (Little & Rubin, 1987; Molenberghs & Verbeke, 2005; Rubin, 1996). The imputation model included the log(SCWT) scores at the different measurement moments and the covariates (i.e., age, gender, LE low, and LE high). The final standard LMM (see Table 3b) was fitted in each of the 10 “complete” data sets, and the 10 inferences were combined into a single one. The r statistic was computed to quantify the uncertainty portion that is stemming from incompleteness—that is, \( r=\frac{{\left( {1+{M^{-1 }}} \right)B}}{\overline{\mu}} \) (where M = the number of imputations, B = the between-imputation variance, and \( \overline{\mu} \) = the within-imputation variance; Schafer, 1999). The final LMM with MI model is presented in Table 3c. The age × time and time parameters had the highest r values (i.e., 2.28 and 1.28, respectively). The r values for the other covariates were substantially lower and ranged from 0.14 to 0.48. In agreement with the results of the multivariate regression model and the standard LMM, there was a significant age × time interaction, which suggested that the increase in the log(SCWT) scores over time was more pronounced for people who were older at baseline (see Fig. 2c). Being male and having a lower LE were associated with higher log(SCWT) scores at all measurement moments. The effect of age2 was again small but significant.

Obtaining normative data

Analogously to the classical regression-based normative approach that is used in nonserial (i.e., single-measurement) testing situations (see the Introduction), three steps are needed to convert a future patient’s log(SCWT) scores into percentile values. First, the expected log(SCWT) scores of patient j at time t are computed \( \left( {=\widehat{{{Y_{tj }}}}} \right) \). Time t refers to the number of years since baseline. These calculations are based on the parameter estimates of the fixed effects that were provided in Table 3.

Second, the differences between the actually observed log(SCWT) scores of patient j at time t and the corresponding expected test scores are computed i.e., \( \left[ {{e_{tj }}=-\left| {\left( {{Y_{tj }}-\widehat{{{Y_{tj }}}}} \right)} \right.} \right] \) and standardized [i.e., \( \left[ {{Z_{tj }}={{{{e_{tj }}}} \left/ {{SD({e_{tg }})}} \right.}} \right] \)]. Note that the sign of the residuals is reversed here because a higher SCWT score is indicative of worse test performance. The \( SD({e_{tg }}) \) values are the standard deviations of the residuals at time t for a person of age group g (younger, ≤65 years at baseline; older, >65 years at baseline) in the normative sample. These values are presented in Table 5.

Table 5 SD(e) values at time t (in years since baseline) for a person of age group g (younger, ≤65 years at baseline; older, >65 years at baseline), as based on the final multivariate regression model (left), the final standard linear mixed model (middle), and the final linear mixed model with multiple imputation (right)

Third, the standardized residuals (i.e., z tj ) are converted into percentile values. Histograms and QQ-plots suggested that the standardized residuals for the different models at all measurement moments were normally distributed in the MAAS (figures not shown), and Kolmogorov–Smirnov tests supported this conclusion (all p-values > .098). The standardized residuals can thus be converted into percentile values by means of the standard normal distribution.

An example

Suppose that a 75-year-old average educated woman with mild cognitive impairment is monitored over time. The patient was administered the SCWT at a baseline moment and 3, 6, and 12 years later. At the subsequent measurement occasions, she obtained SCWT test scores that equalled 80, 85, 90, and 100. The patient’s log(SCWT0), log(SCWT3), log(SCWT6), and log(SCWT12) scores thus equalled 4.382, 4.442, 4.500, and 4.605, respectively.

The clinician uses the LMM with MI approach to evaluate the patient's test performance. This requires three steps. First, the expected log(SCWT0) test score is computed as based on Table 3c—that is, 4.0405 [= 3.878 + (10*0.026) + ((102)*0.0006) + (-5.25*0.019) + ((10*–5.25) * 0.0011)]. Second, the standardized residual is computed (as based on Table 5)—that is, −1.035 (= −(4.382 − 4.0405)/0.33). Third, the standardized residual is converted into a percentile value by means of the standard normal distribution. A standardized residual that equals −1.035 corresponds to a percentile value of 15. Thus, 15 % of the population of 75-year-old cognitively healthy females with an average level of education obtain a log(SCWT0) score that is equal to or higher than the score that was obtained by this woman. Using the same three-step procedure, the patient's log(SCWT3), log(SCWT6), and log(SCWT12) scores were normed. This yielded percentile values equal to 20, 24, and 32, respectively. Thus, the SCWT test performance of the patient is within normal limits at all the measurement moments.

User-friendly normative tables

A clinician can norm the test scores of a patient by performing the required computations by hand (as was illustrated in the previous paragraph), but this procedure is time consuming and prone to making errors. To increase the user-friendliness of the normative data for clinical use, we established normative tables that present the raw SCWT scores that correspond to percentiles 5, 10, 25, 50, 75, 90, and 95, stratified by age (50, 55, . . . , 80 years), gender, and LE (the normative tables can be downloaded at http://home.deds.nl/~wimvde/). The use of the normative tables is straightforward. For example, Table 1 in the online document immediately shows that the SCWT0 score equal to 80 that was obtained by the 75-year-old average educated women of the previous example corresponds to a percentile value between 10 and 25. Note that the normative tables are based on the LMM with MI approach, because this method has some advantages over the other methods (see the Introduction and the Discussion section).

An automatic scoring program

The normative tables are easy to use but lack some accuracy, because (1) the tested patient’s age has to be rounded-off if he or she is not exactly 50, 55, . . . , 80 years old and (2) because only a limited number of percentile values can be presented in the normative tables (to limit their size to a convenient format). To maximize both the user-friendliness and the accuracy of the normative data, the normative conversion procedure was implemented into an Excel worksheet (which can be downloaded at http://home.deds.nl/~wimvde/). The use of the worksheet is straightforward: the clinician simply types in the age, gender, and LE of the tested patient, together with his or her obtained raw SCWT scores at the different measurement moments, and the worksheet automatically computes the corresponding percentile values (on the basis of the LMM with MI approach).

Discussion

The multivariate regression model or the LMM?

The multivariate regression method is primarily useful when normative data have to be established for balanced data structures in which a relatively small number of repeated measurements are considered. Such data structures may arise in a longitudinal context, but they also occur in more general serial testing situations. For example, Rey’s Verbal Learning Test (Rey, 1958; Van der Elst et al., 2005) is a cognitive paradigm in which a sequence of 15 words is repeatedly presented to a participant on five subsequent learning trials. Suppose that we were interested in establishing normative data for these learning trial scores. In this situation, a (almost) perfectly balanced data structure would arise in the normative sample (i.e., fixed time points are used, and missingness would be minimal), and thus the multivariate regression method would provide an adequate tool for establishing the normative data.

In situations where normative data have to be established for highly unbalanced data structures, the LMM approach is the preferred method. For example, a limitation of the present study is that the normative SCWT data can be used only to evaluate the test performance of future patients who are administered the SCWT using (approximately) the same time intervals as the ones that were used in the MAAS. In many clinical settings, variable test–retest intervals are used. For example, a first patient may be tested today, 6 months later and 5 years later, while a second patient may be tested today, 2 weeks later, and 6 weeks later (depending on the clinical profile of the patient). Suppose that we were interested in establishing a single set of normative data that allows for taking such variable test—retest intervals into account. This would, of course, require a normative sample in which variable test—retest intervals are used. For example, a study design may be used in which the lengths of the subsequent test—retest intervals are randomized per person (using some upper and lower limits of clinically relevant test–retest intervals—say, e.g., a value between 0 and 3 years). Thus, the first person in the normative sample may be tested at, for example, time points 0, 0.5, 4, 4.8, 6, 8.3, 10, and 12 years, the second person at time points 0, 2.8, 3, 4.2, 5, 6, 7.8, 9, 9.3, 10, and 11.2 years, and so on. The data structure in the normative sample would thus be highly unbalanced, but this is not a problem when the LMM is used. Only one minor modification to the normative procedure would be required. Indeed, the residual standard deviation values could no longer be computed for each time point separately but would have to be modeled as a continuous function of time (by, e.g., taking the square root of the estimated variances \( \widehat{{{V_i}}}={Z_t}\widehat{D}Z_t^{'}+{{\widehat{\sigma}}^2} \) Verbeke & Molenberghs, 2000).

The standard LMM or the LMM combined with MI?

The standard LMM and the LMM with MI methods are largely equivalent, provided that the imputation model includes all relationships that are considered in the analyses and the inference tasks (Molenberghs & Kenward, 2007; Molenberghs & Verbeke, 2005). Both methods have their advantages and disadvantages. The standard LMM has the advantage that it is easier to conduct and does not require a Monte Carlo component, but it has the disadvantage that it cannot handle missing covariate values. The LMM with MI, on the other hand, has the advantage that it can handle missing covariate values and missing responses simultaneously, but it has the disadvantage that it is more difficult to conduct and requires a Monte Carlo component.

In the present study, none of the covariate values were missing (because only easy-to-measure demographic covariates were considered), but there are several normative settings conceivable in which substantial missingness in the covariate values could arise. For example, suppose that one were interested in establishing IQ-corrected (rather than demographically corrected) normative data for serial test administration (see, e.g., Rentz et al., 2004). Especially in older people, it is not always straightforward to obtain IQ estimates (because of the lengthy and cognitively demanding test procedures that are typically used in IQ tests; La Rue, 1992). Missingness would thus arise in both the covariate values (i.e., the IQ scores) and the responses (i.e., the scores on the cognitive test of interest). In such situations, the use of the LMM with MI method would have the substantial advantage that it allows for using all available data in the normative analyses (while in the standard LMM, the data for people who have missing covariate values would be discarded from the analyses).

Note that the same argument applies in the context of establishing normative data for nonserial testing situations; that is, MI can be used to deal with the missing covariate values, after which classical (i.e., univariate) regression analyses can be conducted on the different completed data sets (and the inferences are combined into a single one).

Is the missingness mechanism relevant when likelihood-based methods are used?

As was noted in the Introduction, ignorable methods assume that the missingness process that generates the missing responses is MCAR or MAR (when likelihood or Bayesian inferences are chosen). In the MAAS and in most other cognitive aging studies, the MCAR assumption is not valid (Van Beijsterveldt et al., 2002). Thus, the missingness mechanism is either MAR or MNAR (Missing Not At Random; i.e., the missingness depends on unobserved data). A definitive test of MAR versus MNAR is not possible (because every MNAR model can be exactly reproduced by a MAR counterpart; Molenberghs, Beunckens, Sotto, & Kenward, 2008), but Verbeke, Molenberghs, and Rizopoulos (2010) argued that ignorable analyses provide reasonably stable results even when the MAR assumption is violated. The reason for this is that such analyses constrain the behavior of the unobserved data to be similar to the behavior of the observed data (Verbeke et al., 2010), and this is exactly what we want in the context of normative analyses. For example, suppose that a MAAS participant dropped out of the study at the second follow-up measurement occasion because he or she developed dementia. The missingness would clearly be associated with the unobserved log(SCWT6) score (i.e., it would be MNAR), but this is not a problem, because the unknown log(SCWT6) score of the demented patient is not of interest. Indeed, in normative studies, we are interested only in the test scores of cognitively healthy participants. When likelihood-based methods are used, the “unobserved” log(SCWT6) and log(SCWT12) scores of the demented patient are modeled on the basis of the observed data of the patient at the previous measurement moments (at which the patient was still cognitively healthy) and on the basis of the observed data at all measurement moments in the normative sample. Since the observed data include only “cognitively healthy” individuals, appropriate estimates are obtained.

So, in the specific case of normative studies, the missingness mechanism is of less importance—at least, when appropriate likelihood-based methods are used. As was noted in the introduction, this is not the case when the regression-based change or the ANCOVA methods are used (i.e., the MCAR assumption is critical for obtaining unbiased norms when these methods are used).

No Reliable Change Indices?

Early attempts to deal with practice effects and establish norms for serial testing situations consisted of computing so-called Reliable Change Indices (RCIs) with correction for practice (Chelune, Naugle, Lüders, Sedlak, & Awad, 1993, see also Jacobson & Truax, 1991). The RCI method uses the overall mean change score and the overall SD(change score) in a normative sample to establish confidence intervals for change scores. By comparing the change score of a patient with these upper and lower boundaries, it can be evaluated whether the patient’s performance has changed significantly (i.e., declined or improved) over time.

We did not consider the RCI method in the present study, because it is merely a special case of the regression-based change method. Indeed, when the change score is not affected by any of the demographic covariates (in the normative sample), the final regression-based change model will include only the intercept (i.e., the overall mean change score), and the SD(e) value will be equal to the overall SD(change score). Thus, apart from the general problems that hamper the validity of the regression-based change method (see the Introduction), the RCI method has the additional limitation that it makes the (unrealistic) assumption that the change scores are not affected by any of the demographic covariates.

Some limitations of the proposed methods

The multivariate regression, LMM, and LMM with MI approaches have some substantial advantages over the regression-based change and the ANCOVA methods (see above), but these models also require some considerations that are not needed when these simple methods are used.

Some assumptions of the models

The (maximum-likelihood-based) multivariate regression and the LMM assume multivariate normality of the residuals. In addition, the LMM assumes normally distributed random effects. No straightforward procedures exist to formally test these assumptions (see Johnson & Wichern, 2007; Verbeke & Molenberghs, 2000), but this is not a severe problem. Indeed, it has been shown that the maximum likelihood estimators for the fixed effects (that are obtained under the assumption of normally distributed random effects) are consistent and asymptotically normally distributed even when the distributions of the random effects are not normal (Verbeke & Lesaffre, 1997). Similarly, the multivariate regression and LMM were shown to be robust against violations of the residual normality assumption (Jacqmin-Gadda, Sibillot, Proust, Molina, & Thiébaut, 2007).

The LMM also assumes that the variance–covariance matrix is correctly specified (to obtain unbiased estimates for the standard errors of \( \widehat{\beta} \)). For (sufficiently large) balanced and complete data sets, the problem of correctly specifying the covariance structure is less stringent because a robust variance estimator can be used (Liang & Zeger, 1986; Verbeke & Molenberghs, 2000). Such an estimator is consistent as long as the mean structure is correctly specified. In settings where there are missing data (which is almost always the case in serial cognitive-testing situations), a more careful reflection on the covariance structure may be warranted. Indeed, too simple a covariance structure (e.g., first-order autoregressive) could lead to bias in the mean model parameters, whereas a too complex covariance structure (e.g., unstructured) may lead to a loss of power (Molenberghs & Kenward, 2007). These issues mainly apply to situations where the sample size is small, because the loss in power that results from using an unstructured covariance matrix is negligible when the sample size is moderate to large. Since normative studies typically include moderate-to-large sample sizes (say, at least 200 people), an unstructured covariance matrix is the first choice. In normative studies with fewer participants, the specification of the covariance structure (and its impact on the established norms) should be evaluated more carefully.

Modeling time trends

The appropriateness of normative data that are established using the multivariate regression, LMM, and LMM with MI approaches depends—among other things—on the assumption that the evolution of the test scores over time is correctly modeled. Indeed, a misspecification of the model in terms of the assumed time effect (e.g., assuming a linear time trend while the true time evolution is quadratic or cubic) has an impact on the predicted test scores and the SD(e) values (both of which are used in the normative conversion procedure).

It is thus important to evaluate whether the assumed time effect corresponds sufficiently well to the actual time evolution. When only a small number of repeated measurements are collected, a straightforward approach is to compare the fit of a model in which time is dummy-coded (thus, a model in which no particular assumptions regarding the time evolution of the outcome are made) with the fit of a model in which a more specific time trend is assumed (e.g., a linear or a quadratic time effect). Since these models are nested, their relative fit can be formally compared by means of likelihood ratio tests (as we also did in the present study; see Table 2). When a larger number of repeated measurements are considered (which are possibly taken at different measurement occasions), it is often no longer feasible (or sensible) to dummy-code time. In this situation, the relative fit of a model in which the effect of time is captured by means of a high-degree polynomial can be compared with the fit of simpler model in which a lower-degree polynomial is used. Again, a likelihood ratio test can be used to formally compare the relative fits of the different models.

It may also be useful to take a more practical perspective and informally evaluate the extent to which the established normative data are affected by the assumed time trend (as a form of sensitivity analysis). By means of illustration, we first converted the raw test scores of the participants in the normative sample into percentile values on the basis of the final LMM that was presented in Table 3b (in which a linear time effect was assumed). Next, the participants’ raw test scores were converted into percentile values on the basis of a second LMM that contained the same covariates as the first model but in which time was dummy-coded. Thus, two percentile values were obtained for each participant at each measurement moment (on the basis of models that made different assumptions regarding the time evolution). The results indicated that the maximum absolute differences between the percentile values that were obtained with both models equalled 1, 3, 1, and 1 units (at the subsequent measurement moments). Such small differences are probably not of clinical relevance, but if these differences are deemed to be too large to be acceptable in a clinical setting, one can still retain the model in which time was dummy-coded as the final model. Thus, considerations of a statistical (e.g., the results of likelihood ratio tests), as well as a practical (e.g., differences in the established norms), nature can be taken into account in the decision process on how the time trend should be optimally modeled.

General conclusion

At present, mainly the regression-based change and the ANCOVA approaches are used to establish normative data for serial cognitive assessment. These methods have the advantage that they are based on the classical (or univariate) linear regression model (which is well-known to most behavioral researchers and straightforward to perform), but they have some major disadvantages (i.e., they can only consider the data of two measurement occasions, and they cannot deal with missing values in an appropriate way).

The multivariate regression, standard LMM, and LMM with MI approaches are not hampered by these problems. The multivariate regression model is primarily applicable when a small number of repeated measurements are taken at fixed time points. As compared with the multivariate regression model, the standard LMM and the LMM with MI approaches allow for a more adequate modeling of the covariance structure. The standard LMM and the LMM with MI are largely equivalent, because they are valid under the same assumptions and neither artificially decrease nor increase the amount of information available. The advantage of the standard LMM is that it is easier to conduct and that it does not require a Monte Carlo component. On the other hand, the LMM with MI has the advantage that it can flexibly deal with missing responses and missing covariates at the same time. When MI is used, it is important that all relationships between the covariates and responses to be studied in the scientific model of interest are included in the imputation model (to avoid “imputing under the null”; Molenberghs & Verbeke, 2005).

The different normative methods were applied to the SCWT data from the MAAS. The results showed that the log(SCWT) scores were significantly affected by age, age2, time, gender, and LE. These covariates should thus be taken into account in the construction of the normative data. There was also a significant time × age interaction, which suggested that the increase in the log(SCWT) scores over time was more pronounced for older people (as compared with their younger counterparts). These results are in line with previous findings in the cognitive aging literature (Salthouse, 1996; Schmand et al., ; Stern, 2003; Van der Elst, 2006; Van der Elst et al., 2006d). To increase the user-friendliness of the normative SCWT data, normative tables and an automatic scoring program were provided (based on the results of the LMM with MI approach).