Introduction

Neuropsychological assessment guides diagnostics and treatment in a wide range of clinical conditions (e.g., traumatic brain injury, epilepsy, functional neurological disorder, attention deficit hyperactivity disorder, multiple sclerosis, or mild cognitive impairment). Therefore, it is important that neuropsychological test results accurately represent a patients’ actual cognitive abilities. However, personal factors such as a lack of task engagement or malingering can invalidate a patient’s test performance (Schroeder & Martin, 2022). When invalid performance is not properly identified, clinicians risk attributing abnormally low scores to cognitive impairment, potentially leading to misdiagnosis and ineffective or even harmful treatments (e.g., Roor et al., 2016; van der Heide et al., 2020). Consequently, performance invalidity is not only relevant to diagnostics, but also extends to treatment efficacy (Roor et al., 2022).

Various tests are available for determining invalid performance on cognitive tests (for an overview, see Soble et al., 2022). Performance validity tests (PVTs) can be specifically designed to measure performance validity (i.e., stand-alone PVTs), or empirically derived from standard cognitive tests (i.e., embedded indicators). Overall, the psychometric properties of stand-alone PVTs have been found to be superior in comparison to embedded PVTs (Miele et al., 2012; Soble et al., 2022). Using well-researched stand-alone PVTs, the meta-analyses of Sollman and Berry (2011) found their aggregated mean specificity to be 0.90, with a mean sensitivity of 0.69. This finding is typical for stand-alone PVTs, for which empirical cutoff scores are chosen at a specificity of ≥ 0.90 to minimize the misclassification of a valid cognitive test performance as non-valid (i.e., a maximum 10% false positive rate).

Importantly, sensitivity and specificity should never be interpreted in isolation from other clinical metrics like base rates (Lange & Lippa, 2017). To determine the positive and negative predictive value of a PVT score, the base rate of the condition (here: performance invalidity) needs to be considered (Richards et al., 2015). Using Bayes' rule, the likelihood that PVT failure is indeed indicative of performance invalidity can be calculated based upon: 1) the base rate of invalid performance in the specific population of that individual; 2) the score of a PVT; 3) the sensitivity; and 4) the specificity of the utilized PVT (Dandachi-FitzGerald & Martin, 2022; Tiemens et al., 2020). Ignoring Bayes' rule potentially leads to overdiagnosis of invalid performance when the base rate of invalidity is low and to underdiagnosis when the base rate is high. Therefore, it is essential that base rate information is available for each PVT in a specific clinical context and, ideally, for specific clinical patient groups (Schroeder et al., 2021a).

Early surveys amongst non-forensic clinical neuropsychologists reported an expectation that only 8% of general clinical referrals would produce invalid test results (Mittenberg et al., 2002). Over the last two decades, research on validity issues in clinical practice increased significantly, and neuropsychologists have become more aware of the need to identify invalid test performance (Merten & Dandachi-FitzGerald, 2022; Sweet et al., 2021). These factors probably contributed to the findings of a nearly double median reported base rate of 15% across clinical contexts and settings in a more recent survey (Martin & Schroeder, 2020). However, there has been a delay in research examining empirically derived bases rates of invalidity in clinical settings.

To address this issue, McWhirter et al. (2020) undertook a systematic review to examine PVT failure in clinical populations. Their main finding was that PVT failure rates were common, exceeding 25% for some PVTs and clinical groups. However, their study has been criticized on several aspects. First, Kemp and Kapur (2020) did not distinguish between stand-alone and (psychometrically inferior) embedded PVTs. Second, McWhirter et al. (2020) included studies that examined PVT failure in patients with dementia and intellectual disabilities, two groups in which PVTs are strongly discouraged due to unacceptable high false-positive rates when using the standard cutoffs (Larrabee et al., 2020; Lippa, 2018; Merten et al., 2007). Third, studies with ≥ 50% of the patient sample was involved in litigation or seeking welfare benefits were excluded, other types and lower rates of external gain incentives were not characterized. Therefore, external incentives that increase PVT failure rates in patients engaged in standard clinical evaluations (Schroeder et al., 2021b) may have contributed to their reported PVT failure rates. Importantly, McWhirter et al. (2020) summarized data on PVT failure based upon the literature search and data extraction performed by one author, without considering the quality of included studies or calculating a weighted average to get a more precise estimate.

The current meta-analysis is designed to address these gaps to improve the quality of reported PVT failure findings in clinical patient groups. The main aim of the present study is to provide comprehensive information regarding the base rate of PVT failure to facilitate its interpretation in clinical practice. We calculated pooled estimates of the base rate of PVT failure across the type of clinical context, distinct clinical patient groups, the potential for external incentives, and per PVT.

Methods

Search Strategy

This meta-analysis was conducted in accordance with updated Preferred Reporting Items for Systematic Review and Meta-analyses guidelines (PRISMA; Page et al., 2021). A review protocol was registered at inception on PROSPERO (ID: CRD42020164128). The protocol was slightly modified to further improve the quality of included studies. Specifically, only stand-alone PVTs were included that met the restrictive selection criteria per Sollman and Berry (2011), and one additional database was searched. Electronic databases (PubMed/MEDLINE, Web of Science, and PsychINFO) were comprehensively searched using multiple terms for performance validity and neuropsychological assessment (see Online Resource 1 for detailed search strategies). Finally, we chose to focus solely on the base rate of PVT failure without also addressing its impact on treatment outcome. The final search was conducted on November 5, 2021.

Study Selection

All studies in this systematic review and meta-analysis were performed in a clinical evaluation context of adult patients (18 + years of age), using standard/per manual administration procedure and cutoffs for the five stand-alone performance validity tests (PVTs) from Sollman and Berry (2011). These five PVTs are: the Word Memory Test (WMT; Green, 2003), the Medical Symptom Validity Test (MSVT; Green, 2004), the Test of Memory Malingering (TOMM; Tombaugh, 1996), the Victoria Symptom Validity Test (VSVT; Slick et al., 1997), and the Letter Memory Test (LMT; Inman et al., 1998). Based upon Grote et al. (2000), a higher cutoff was used for the hard items of the VSVT in patients with medically intractable epilepsy. All studies were original, peer-reviewed, and published in English. Studies were excluded if they examined PVT failure rate in a non-clinical context (i.e., forensic/medico-legal context, data generated for research purposes). Studies that only addressed PVT failure in a sample already selected upon initially passing/failing a PVT were equally excluded (typically known-groups design). Studies performed on patients diagnosed with intellectual disability or dementia were excluded, as well as studies with a small (sub)sample size (N < 20). Finally, we chose to exclude studies of Veterans/military personnel since the distinction between clinical and forensic evaluations are difficult to make within the context of the Veterans Affairs (VA) system (Armistead-Jehle & Buican, 2012).

Unique patient samples were ensured by carefully screening for similar samples used in different studies. In case multiple studies examined the same patient sample, data with the largest sample size was included, or, when equal, the most recent paper.

Data Collection and Extraction

References resulting from the searches in PubMed/MEDLINE, Web of Science, and PsychINFO were imported into a reference manager (EndNote X8). After automatic duplicate removal, one of the investigators (JR) manually removed the remaining duplicate references. First, a single rater (JR) screened all titles and abstracts for broad suitability and eligibility. Doubtful references were addressed with a second rater (MP). If doubts remained, references were included for full-text scrutinization. Second, two independent raters (MP and JR) reviewed the remaining full-texts based on the mentioned inclusion and exclusion criteria, for which the online systematic review tool Rayyan (Ouzzani et al., 2016) was used. The interrater reliability was substantial (Cohen’s k = 0.63), and agreement 89.83%. A sizable number of studies failed to clearly state information used for inclusion in the current study, which contributed to the suboptimal agreement between the two independent raters. Therefore, corresponding authors were contacted when additional information was required (e.g., regarding clinical context, utilized PVT cutoff, number of subjects that were provided and failed a PVT, or language/version of the utilized PVT). Non-responders were reminded twice, and if no author response was elicited, studies were excluded. Discrepancies were resolved by discussion with a third and fourth reviewer (BD and RP). Finally, one investigator (JR) extracted relevant information from the included full-text articles, such as setting, sample size, mean age, and utilized PVT(s) according to a standardized data collection form (see Online Resource 2).

Statistical Analyses

Statistical analysis was performed using MetaXL version 5.3 (www.epigear.com), a freely available add-in for meta-analysis in Microsoft Excel. Independence of effect sizes, a critical assumption in random-effects meta-analyses, was examined by checking if and how many studies used multiple, potentially inter-correlated PVTs from the same patient sample (Cheung, 2019). The frequency of PVT failure from the individual studies were pooled into the meta-analysis using a double-arcsine transformation. Back transformation was performed to report the pooled prevalence rates. We chose to use this transformation method to stabilize variance in the analysis. The double arcsine transformation has been shown to be preferential to logit transformation or no transformation usage in the calculation of pooled prevalence rates (Barendregt et al., 2013). All analyses were performed using the random-effects model since it allows between-study variation of PVT failure. Forest plots were used to visualize the pooled prevalence of PVT failure, with 95% confidence intervals [CIs]. Where possible, subgroup analyses were performed to examine whether the base rate of PVT failure was related to specific clinical contexts, distinct patient groups, utilized PVT, and the consideration of the presence of potential external gain. To further establish the generalizability of our study findings, the consistency across the included studies was assessed using the Cochran’s Q-test (Higgins et al., 2003). For the Q-test, a p-value < 0.10 was considered to indicate statistically significant heterogeneity between studies. Because the number of included studies impacts the Q-test, we additionally evaluated the inconsistency index I2 (Higgins & Thompson, 2002). An I2 value over 75% would tentatively be classified as a “high” degree of between-study variance (Higgins et al., 2003). Since I2 is a relative measure of heterogeneity and its value depends on the precision of included studies, we also calculated Tau squared (τ2). This measure quantifies the variance of the true effect sizes underlying our data, with larger values suggesting greater between-study variance (Borenstein et al., 2017).

Study Quality

An adapted version of the Prevalence Critical Appraisal Tool of the Joanna Briggs Institute (Munn et al., 2015) was used to rate the quality of all included studies. Amongst the currently available tools, it addresses the most important items related to the methodological quality when determining prevalence (Migliavaca et al., 2020). Three study quality domains were assessed: selection bias (items 1, 2, and 4), sample size/statistics (items 3 and 5), and attrition bias (item 6; see Online Resource 3 for a detailed description).

Doi plot and LFK index are relatively new graphical and quantitative methods that were used for detecting publication bias (Furuya-Kanamori et al., 2018). These analyses were also implemented using MetaXL. Contrary to the scatter plot of precision used in a more standard funnel plot to examine publication bias, the Doi plot uses a quantile plot providing a better visual representation of normality (Wilk & Gnanadesikan, 1968). A symmetric inverted funnel is created with a Z-score closest to zero at its tip if the trials are not affected by publication bias. The LKF index then quantifies the two areas under the Doi plot. The interpretation is based on the a-priori concern about positive or negative publication bias. Since we were concerned about possible positive publication bias, the LFK > 1 was used consistent with positive publication bias. Even in the case of limited included studies, the LKF index has a better sensitivity over the more standard Egger's test (Furuya-Kanamori et al., 2018).

Results

Literature Search

Figure 1 gives an overview of the search and selection process. Of the 13,587 identified abstracts, 457 (3.4%) were included for full-text scrutiny. We contacted the first author of 37 studies for additional information, and 30 authors responded. This resulted in 47 observational studies of PVT failure in the clinical context, with a total sample size of n = 6,484.

Fig. 1
figure 1

PRISMA flow chart of study selection

Characterization of Included Studies

Table 1 reports study characteristics, including clinical context, clinical patient group, and sample size. Most studies were performed in a medical hospital (k = 25), with others in an epilepsy clinic (k = 7), psychiatric institute (k = 6), rehabilitation clinic (k = 4), and private practice (k = 2). Three studies (6.2%) did not specify clinical context. In 15/47 (31.9%) of the studies, prevalence of PVT failure was reported for heterogeneous patient samples. The majority of the studies (32/47; 68.1%) reported PVT failure rates for one or multiple diagnostic subgroups. The diagnostic (sub)groups constituted of patients with traumatic brain injury (TBI) in most studies (k = 10), followed by patients with epilepsy (k = 9), patients with psychogenic non-epileptic seizures (PNES; k = 5), patients that were seen for attention deficit hyperactivity disorder (ADHD) assessment (k = 4), patients with mild cognitive impairment (MCI: k = 4), patients with multiple sclerosis (MS; k = 2), and patients with Parkinson’s disease (k = 2). Severity of TBI was not always specified or was poorly defined. The remaining diagnostic (sub)groups (i.e., sickle cell disease, Huntington’s disease, patients substance-use related disorders (SUD), inpatients with depression, memory complaints) were examined in single studies. In more than half of the included studies (25/47; 53.2%), the language (-proficiency) of the included patient sample was not reported. Potential external gain was not mentioned in 12/47 (25.5%) studies, and the remaining studies varied greatly in how they addressed its presence. Of the remaining 35 studies, only seven (i.e., 20%) specified how external gain was examined (Domen et al., 2020; Eichstaedt et al., 2014; Galioto et al., 2020; Grote et al., 2000; Rhoads et al., 2021a; Williamson et al., 2012; Wodushek & Domen, 2020). In most studies (28/35; 80%), the way the authors examined this variable (e.g., by checking the medical record of patient, querying patients about potential incentives being present during the assessment procedure) was not specified. Moreover, in only 4/35 (11.4%) studies, subjects were excluded when external gain incentives (e.g., workers compensation claim) were present (Dandachi-FitzGerald et al., 2020; Davis & Millis, 2014; Merten et al., 2007; Wodushek & Domen, 2020).

Table 1 Summary details for individual studies that reported the prevalence of PVT failure in clinical patients

The TOMM was the most frequently administered PVT (k = 18), followed by the WMT (k = 17), the MSVT (k = 9), the VSVT (k = 6), or the LMT (k = 1). Only 4/47 (8.5%) studies employed two PVTs (none used > 2 PVTs that fulfilled the inclusion/and exclusion criteria). The other 43/47 (91.5%) studies used one PVT. In two of the four studies reporting two PVTs, the same PVTs were not administered to all participants. Harrison et al. (2021) administered the MSVT to 648 patients and the WMT to 1810 patients, and Krishnan and Donders (2011) administered the TOMM to 39 patients and the WMT to 81 patients. Inclusion in these studies was – amongst others – based upon failing one PVT. Furthermore, these studies did not report the number of subjects that were provided with both PVTs. Therefore, it is unclear to what extend the reported PVT failure rates in these studies are influenced by potential dependence. In the two other studies reporting two PVTs (i.e., Cragar et al., 2006; Merten et al., 2007), all patients were administered both PVTs. The total number of subjects in these two studies that reported two likely dependent effect-sizes was n = 76. This is 1.2% of the total of n = 6487 patients from all 47 studies. We therefore argue that the reported effect sizes from the 47 included studies are (largely) independent.

Methodological Quality Assessment

A summary of the methodological quality of the included studies for determining prevalence is provided in Online Resource 4. No study was rated as having high quality; all had limitations in at least one of the three prespecified domains (selection bias, attrition bias, and sample size/statistical analyses). Most studies had a study sample that addressed the target population (k = 41, 87.2%), whereas only a minority described relevant assessment and patient characteristics (n = 15, 31.9%). The majority of included studies failed to clearly state how patients were recruited (n = 27, 57.4%). Eleven studies (23.4%) had an inadequate response rate. The majority of the studies used appropriate statistical analyses (n = 41, 87.2%), but also had inappropriate sample sizes (n = 39, 83.0%).

The shape of the Doi plot showed slight asymmetry (see Online Resource 5), and the results of the LFK index (1.09) revealed minor asymmetry indicative of potential positive publication bias.

Base Rate of PVT Failure in Clinical Patients

The pooled prevalence of PVT failure of all (n = 47) included studies was 16%, 95% CI [14, 19]. Significant between-study heterogeneity and high between-study variability existed (Cochran's Q = 697.97, p < 0.001; I2 = 91%; τ2 = 0.08) as revealed by the large 95% CIs (see Fig. 2). The high I2 statistic indicates that the variation in reported PVT failure is likely a result of true heterogeneity rather than chance.

Fig. 2
figure 2

Forest plot of the 47 included studies estimating the pooled prevalence of PVT failure in the clinical setting CI = confidence interval. Note: Weights are from random effects analysis

Subgroup Analyses based upon Clinically Relevant Characteristics

To facilitate the interpretation of PVT failure in clinical practice, subgroup analyses were performed for clinically relevant characteristics associated with performance validity (Table 2). It is important to emphasize that some of these findings are based upon relatively small numbers of studies (i.e., k = 2 or 4), potentially impacting the stability if the reported estimates.

Table 2 Pooled prevalence of PVT failure in clinical patients, stratified by false-positive scrutinization, clinical context, external gain incentives, clinical diagnosis, and PVT

False-Positive Scrutinization

Although we excluded studies that examined PVT failure rates in patients with dementia or intellectual disability a priori, the included studies might still comprise patient samples with other conditions or combinations of characteristics that make them highly susceptible to false-positive PVT failure classification. Therefore, and in line with clinical guidelines (Sweet et al., 2021), we first examined included studies for the risk of unacceptably low specificity rates when applying standard PVTs cutoffs, and two studies were identified. First, PVT performance in the subsample of severely ill schizophrenia spectrum and mostly inpatients from Gorissen et al. (2005) was significantly correlated with negative symptoms and general psychopathology. Second, the MCI subjects from Martins and Martins (2010) were of advanced age, Spanish speaking, and had the lowest formal schooling of all included studies (i.e., 71.4% had less than 6 years of formal education). These cultural/language factors in combination with low formal schooling are associated with unacceptably low specificity rates when applying standard PVT cutoffs (Robles et al., 2015; Ruiz et al., 2020). Exclusion of the subsample of patients with schizophrenia in the Gorissen et al. (2005 study) and of the Martins and Martins (2010) study led to a pooled prevalence of PVT failure of 15% (95% CI [13, 18]; Cochran's Q = 573.73, p < 0.01; I2 = 89%; τ2 = 0.07). However, after exclusion of these patient samples, between-study heterogeneity and between-study variability were still high as indicated by a significant Cochran's Q statistic and high and I2 statistic. Further subgroup analyses were performed in the remaining studies (k = 46; see Table 1).

Clinical Context

The pooled prevalence of PVT failure was the highest in the context of a private practice (27%, 95% CI [15, 40]; Cochran's Q = 2.98, p = 0.08, I2 = 66%; τ2 = 0.03), followed by the epilepsy clinic (19%, 95% CI [10, 29]; Cochran's Q = 128.07, p < 0.001, I2 = 91%; τ2 = 0.17), the psychiatric institute (15%, 95% CI [10, 21]; Cochran's Q = 92.30, p < 0.001, I2 = 92%; τ2 = 0.04), the medical hospital (12%, 95% CI [10, 15]; Cochran's Q = 160.51, p < 0.001, I2 = 81%; τ2 = 0.05) and the rehabilitation clinic (13%, 95% CI [4, 25]; Cochran's Q = 31.07, p < 0.001, I2 = 88%; τ2 = 0.10). As can be seen, heterogeneity of pooled PVT failure rates was significant and between-study variability was moderately-high to high for all types of clinical context.

Clinical Diagnoses

The pooled prevalence of PVT failure was the highest for patients with PNES (33%, 95% CI [ 24, 43]; Cochran's Q = 10.65, p = 0.06; I2 = 53%; τ2 = 0.03), followed by subjects seen for ADHD assessment (17%, 95% CI [11, 23]; Cochran's Q = 68.80, p < 0.01; I2 = 94%; τ2 = 0.03), (m)TBI (17%, 95% CI [10, 25]; Cochran's Q = 89.57, p < 0.01; I2 = 89%; τ2 = 0.09), MS (13%, 95% CI [9, 18]; Cochran's Q = 0.32, p = 0.57; I2 = 0%; τ2 = 0.00), epilepsy (11%, 95% CI [6, 16]; Cochran's Q = 42.21, p < 0.001; I2 = 79%; τ2 = 0.05), MCI (9%, 95% CI [4, 16]; Cochran's Q = 0.11, p = 0.95; I2 = 0%; τ2 = 0.00), and Parkinson's disease (6%, 95% CI [1, 15]; Cochran's Q = 1.81, p = 0.18; I2 = 45%; τ2 = 0.02). Based upon Cochran's Q, heterogeneity of pooled PVT failure rates was significant in patients with PNES, (m)TBI, epilepsy, Parkinson's disease, and subjects seen for ADHD assessment. Non-significant heterogeneity in pooled PVT failure rates was found in patients with Parkinson's disease, MCI, and MS. Based upon the I2 statistic, variability of base rate estimates of PVT failure was low in patients with MCI and MS, and (moderately) high for the other diagnostic patient groups. This suggests that for studies in patients with MCI and MS, the pooled PVT failure rates are more homogeneous. However, since these calculations are based upon small numbers of studies, these findings should be interpreted with caution (von Hippel, 2015).

External Gain Incentives

In the four studies where patients with potential external gain incentives were excluded from analysis, the pooled prevalence of PVT failure was as low as 10% (95% CI [5, 15]; Cochran's Q = 9.17, p = 0.10; I2 = 45%; τ2 = 0.02). For the 42 remaining studies that did not report to have actively excluded clinical patients with potential external gain incentives before reporting PVT failure, however, the pooled prevalence of reported PVT failure was 16% (95% CI [13, 19]; Cochran's Q = 560.93, p < 0.001; I2 = 90%; τ2 = 0.07). Although Cochran's Q statistic indicated that heterogeneity of pooled PVT failure rates in both groups was high, inconsistency was lower in the studies where patients with external gain were excluded from analysis.

PVT

The pooled prevalence of PVT failure was the highest for patients examined with the WMT (25%, 95% CI [19, 32]; Cochran's Q = 253.52, p < 0.001; I2 = 93%; τ2 = 0.10), followed by the (nv)MSVT (18%, 95% CI [13, 23]; Cochran's Q = 55.04, p < 0.001, I2 = 85; τ2 = 0.03), the TOMM (9%, 95% CI [6, 12]; Cochran's Q = 103.35, p < 0.001, I2 = 80; τ2 = 0.05), and the hard items of the VSVT (9%, 95% CI [7, 12]; Cochran's Q = 24.97, p < 0.001, I2 = 64; τ2 = 0.01). Heterogeneity of pooled PVT failure rates was significant across studies examining the same PVT, whereas the between-study variability was moderately-high for studies using the VSVT and high in studies using other PVTs.

Discussion

This systematic review and meta-analysis examined the prevalence of PVT failure in the context of routine clinical care. Based on extracted data from all 47 studies involving 6,484 patients seen for clinical assessment, the pooled prevalence of PVT failure was 16%, 95% CI [14, 19]. Excluding two studies that likely represented patients where standard PVT cutoff application would probably lead to false positive classification, resulted in a pooled PVT failure of 15%, 95% CI [13, 18]. This number corresponds with the median estimated base rate of invalid performance in clinical settings reported in a recent survey amongst 178 adult-focused neuropsychologists (Martin & Schroeder, 2020). Our empirical findings confirm PVT failure in a sizeable minority of patients seen for clinical neuropsychological assessment.

Another key finding is that reported PVT failure rates vary significantly amongst the included studies (i.e., 0–52.2%). This variability is likely due to (1) sample characteristics, such as clinical setting, clinical diagnosis, and potential external incentives, and (2) the sensitivity and specificity of the PVT used. Pooled PVT failure was found to be highest (i.e., 27%, 95% CI [15, 40]) in patients seen in private practice. The pooled PVT failure rates for the other settings (i.e., epilepsy clinic, psychiatric institute, medical hospital, and rehabilitation clinic) varied between 13–19%. The Sabelli et al. (2021) study had the largest private practice sample (N = 326), consisting of relatively young mTBI patients referred for neuropsychological evaluation. Since only 2/47 of the included studies were conducted in the private practice setting, the Sabelli et al. (2021) study with a PVT failure rate of 31.9%, was a major contributor to the higher pooled PVT failure rate in a private practice setting. Of interest, potential external incentives were not mentioned in that study. Therefore, potential external gain incentives may have been present and impacted the relatively high level of PVT failure rather than assessment context per se. Unsurprisingly, but now clearly objectified, studies that excluded patients with potential external gain incentives had a significantly lower pooled PVT failure rate compared to studies where these subjects (potentially) remained in the analysis (i.e., 10%, 95% CI [ 5, 15] versus 16%, 95% CI [13, 19] respectively). However, although it is known that the presence of external incentive links directly to PVT failure in clinical evaluations (e.g., Schroeder et al., 2021b), little over a quarter of the included studies failed to mention the presence of external gain incentives. Moreover, even when external gain incentives were known to be present, only a minority of studies excluded these subjects from further analyses. Pooled PVT failure rates were highest for patients diagnosed with PNES (i.e., 33%, 95% CI [24, 43]), patients seen for ADHD assessment (i.e., 17%, 95% CI [11, 23]), and (m)TBI (i.e., 17%, 95% CI [10, 25) with pooled PVT failure rates ranging between 6–13% for the other diagnostic groups (i.e., MS, epilepsy, MCI, and Parkinson's disease). These findings contrast with McWhirter et al. (2020), who reported PVT failure in subjects with functional neurological disorders (such as PNES) are no higher compared to MCI or epilepsy. Likely, our strict inclusion and exclusion criteria, inclusion of only well-validated stand-alone PVTs, and meta-analysis application lead to a more precise estimate of PVT failure across diagnostic groups. Our findings also indicate that pooled PVT failure rates for MCI, MS, and Parkinson's disease diagnostic groups are more homogeneous than those of PNES, (m)TBI, and patients seen for ADHD assessment. The higher levels of heterogeneity in these latter groups could indicate that other factors that likely impact PVT failure were present, such as external gain incentives, variation in diagnostic criteria, and bias in patient selection. Finally, pooled failure rates varied across the utilized PVTs in line with their respective sensitivity/specificity ratios in correctly identifying invalid performance. The WMT is known for its relatively high sensitivity (Sollman & Berry, 2011), which likely resulted in the highest pooled failure rate amongst the examined stand-alone PVTs. The lowest pooled failure rate for the TOMM is probably related to its high specificity (Martin et al., 2020).

Our findings indicate that in addition to PVT psychometric properties (i.e., sensitivity and specificity), the clinical setting, the presence of external gain incentives, and the clinical diagnosis impact pooled PVT failure rates. The clinician should therefore consider these factors when interpreting PVT results. Consider, for example, a well-researched stand-alone PVT with a sensitivity of 0.69 and specificity of 0.90, administered to two different clinical patients. The first patient is diagnosed with epilepsy and wants to get approved to return to work (i.e., no external gain incentives for invalid performance). If the mentioned PVT were failed in the context of this patient without external gain incentives (base rate PVT failure of 10%, see Table 2), the likelihood that PVT failure was indeed a true positive (i.e., positive predictive value, PPV) would be 43%. The second patient is also diagnosed with epilepsy but has a pending disability application because the patient does not believe he/she is able to return to work (i.e., potential external gain incentive for invalid performance). If the same PVT were failed in the context of this patient with potential external gain incentive (base rate PVT failure of 16%, see Table 2), PPV would be 57%.

Of importance, although a PPV increase of 0.43 to 0.57 is substantial, the latter is still not sufficient to determine performance validity. Therefore, in line with general consensus multiple, independent, validity tests should be employed (Sherman et al., 2020; Sweet et al., 2021). By chaining the positive likelihood-ratios (LRs) of multiple failed PVTs, the diagnostic probability of invalid performance (or PPV) is increased and the diagnostic error is decreased (for an explanation of how to chain likelihood ratios see Larrabee, 2014; Larrabee, 2022). Note that while considerable weight should be placed on the psychometric evaluation of performance validity, the clinician should also include other test and extra-test information (e.g., degree of PVT failure, (in)consistency of the clinical presentation) to draw conclusions about the validity of an individual patient's neuropsychological assessment (Dandachi-FitzGerald & Martin, 2022; Larrabee, 2022; Sherman et al., 2020).

Strengths of the present study are its strict inclusion/ exclusion criteria ensuring accurate PVT results. Unfortunately, none of the studies fulfilled all components of the three pre-defined quality criteria selection bias, attrition bias, and adequate sample size/statistics for determining prevalence. Although we excluded studies with < 20 subjects, most of the remaining studies still relatively small sample sizes, increasing the likelihood of sampling bias and heterogeneity. Moreover, only 20/47 studies reported appropriate recruitment method (e.g., consecutive referrals of a good census) necessary for determining the base rate of PVT failure. Additionally, diagnostic criteria varied across studies, limiting the generalizability of their calculated PVT failure base rates. Also, the way potential external gain incentives were examined and defined varied significantly. Surprisingly, in just over one quarter of included studies, potential external gain incentives were not mentioned at all, and potential external gain incentives may have been present. Finally, although language (-proficiency) and cultural factors relate to PVT failure (Robles et al., 2015; Ruiz et al., 2020), these factors were not mentioned in more than half of the included studies in our meta-analysis.

Additional empirical research is necessary to advance knowledge of performance validity test failure in clinical populations. An important first step in future research should be to provide comprehensive details regarding study design, such as recruitment procedure, clinical setting, and demographic/descriptive information (e.g., cultural factors, age, language and language proficiency, level of education). A second improvement would be to form comparable and homogeneous patient samples by specifying diagnostic criteria and providing a detailed specification of how external gain incentives were examined (e.g., querying the patient for potential external gain incentives, such as pending litigation or disability procedures; Schroeder et al., 2021a). Since administration of multiple PVTs is recommended (Sweet et al., 2021), future studies and specifically meta-analyses should consider using advanced statistical techniques (e.g., three-level meta-analyses) in handling non-independent effect sizes (Cheung, 2019).

In conclusion, the current meta-analysis demonstrates that PVT failure occurs in a substantial minority of patients seen for routine clinical care. Type of clinical context, patient characteristics, presence of external gain incentives, and psychometric properties of the utilized PVT are found to impact the rate of PVT failure. Our findings can be used for calculating clinically applied statistics (i.e., PPV/NPV, and LRs) in everyday practice to increase the diagnostic accuracy of performance validity determination. Future studies using detailed recruitment procedures and sample characteristics, such as external gain incentives and language (proficiency), are needed to further improve and refine knowledge about the base rates of PVT failure in clinical assessments.