Abstract
Generalizability considerations are widely discussed and a core foundation for understanding when and why treatment effects will replicate across sample demographics. However, guidelines on assessing and reporting generalizability-related factors differ across fields and are inconsistently applied. This paper synthesizes obstacles and best practices to apply recent work on measurement and sample diversity. We present a brief history of how knowledge in psychology has been constructed, with implications for who has been historically prioritized in research. We then review how generalizability remains a contemporary threat to neuropsychological assessment and outline best practices for researchers and clinical neuropsychologists. In doing so, we provide concrete tools to evaluate whether a given assessment is generalizable across populations and assist researchers in effectively testing and reporting treatment differences across sample demographics.
Similar content being viewed by others
Despite an increasing push for consideration of participant demographics, generalizability remains a contemporary threat to psychological research and the validity of neuropsychological assessment. The National Institutes of Health (NIH, 2022) has strengthened requirements to disaggregate effects by race/ethnicity and gender. Still, recent randomized control trials (RCTs) indicate a persistent lack of demographic reporting, with White participants making up to 75% or more of the reported samples (De Jesús-Romero et al., 2022, Preprint). That being said, recent studies suggest improvements in the reporting of sample characteristics within both neuropsychology (Medina et al., 2021) and pediatric psychology (Raad et al., 2008). Given changes in recommendations and norms for reporting sample demographics, an important consideration is how this information can be used to understand and further test generalizability of findings and assessments in pediatric neuropsychology.
Discussions of generalizability in psychology, and neuropsychology specifically, are not new (Drotar, 1994; Okazaki & Sue, 1998; Sears, 1986). Yet, as evident from the need for related initiatives taken by journals such as Neuropsychology (Yeates, 2022), this issue continues to be a contemporary threat that warrants attention. Indeed, the agenda of Relevance 2050 Initiative by the American Academy of Clinical Neuropsychology (AACN) estimates that 60% of the American population will be “un-testable” by 2050 using current neuropsychological assessment strategies. Thus, we aim to discuss the role of historical influences on generalizability in neuropsychological assessment and synthesize best practices for researchers and clinicians to consider generalizability, including how to overcome several barriers toward advancing generalizability in the field of pediatric neuropsychological assessment.
We define generalizability as the extent to which outcomes of research with certain samples are applicable to the general human population, or across multiple populations and contexts. To concretely evaluate generalizability, samples should include sufficient representation of the population of interest (e.g., avoiding too narrow of a data base, Sears, 1986). In the context of neuropsychological testing, if a test does not specify sample demographics, the clinician is left to wonder whether the test is a good fit for their patient. A lack of generalizability in assessment research can also result in a lack of test fairness or diagnostic error for clients who are underrepresented in research, contributing to compounded stigmatization in the healthcare system. We situate generalizability as a consideration that is inherently intertwined with the scientific method: one that applies at multiple points in the research process and aids in understanding a phenomenon more broadly.
In the following sections, we provide a summary of past discussions of generalizability in order to situate our proposed recommendations for pediatric neuropsychology within the larger field of psychology.Footnote 1 We begin with a broad overview of generalizability in psychology, then move to evidence suggesting that generalizability remains a contemporary threat to pediatric neuropsychology research, and, finally, provide strategies to overcome common obstacles to advancing generalizability.
Historical Considerations for Psychology and Neuropsychology
Neuropsychology and psychology in general are never wholly detached from historical trends within the field. Thus, evaluating generalizability and sample demographics must integrate knowledge of how the field has approached these issues in the past. Psychology has been built from a narrow knowledge base driving both active and passive misrepresentations and mistreatment of several stigmatized groups (Guthrie, 2004; Henry, 2008; Sears, 1986).
Notably, stigmatized groups who have been passively and actively harmed by assessment or research are more likely to have medical mistrust that reflects these injustices (Jaiswal & Halkitis, 2019) and reduced cultural safety (Gale et al., 2022). These relationships with the field feed into treatment-seeking barriers and treatment disparities (Haag et al., 2019; Rivera-Mindt et al., 2010). For example, this exclusive knowledge base has contributed to gaps in understanding and treating Black women’s mental health (see Spates, 2012 for a review). More broadly, foundational “facts” developed by a narrow knowledge base may not generalize, evidenced by misdiagnosis disparities across race (e.g., Gara et al., 2012; Kiselica et al., 2021). Thus, it is crucial to recognize environmental and cultural factors that may influence test outcomes in diverse clients (Ardila, 2005).
The early development and application of intelligence testing is a prime example of the active historical harms of assessments and repercussions associated with failure to account for generalizability. As outlined in Boake (2002) and Wasserman (2018), French psychologists Alfred Binet and Theodore Simon developed early intelligence testing to provide special needs for children whose scores were below the score of an average child of their own age. The test was translated and adapted for use in the USA by psychologist Henry Herbert Goddard, gaining immense popularity and wide usage in schools and other areas, including influencing immigration decisions and policy. Goddard’s conclusions quickly contributed to the use of intelligence assessments to support eugenics. The movement advocated for restraining certain populations from having children through institutional isolation or sexual sterilization. Hence, families who had a child with low intelligence scores were stigmatized and deemed genetically flawed. Furthermore, in 1913, the intelligence test was used for immigrants at Ellis Island to assess English proficiency, labeling performance as intellectual functioning without considering that individuals lacked formal education and/or any exposure to the English language.
Similarly, in the 1900’s, intelligence tests developed by the American Psychological Association Committee on Psychological Examination of Recruits highlighted racial disparities in test scores. This bolstered the claim that intelligence is influenced not only by genetics but also by race, fueling immigration and sterilization laws that marginalized Black Americans and the immigrant population in the USA. For example, in 1924, the state of Virginia allowed compulsory sterilization of people with low intelligence, and North Carolina established an official Eugenics Board with racialized sterilization decisions well into the 1970s (Schoen, 2005). The harm of eugenics was thus legitimized by early intelligence testing and psychological science. Indeed, this recent history informs how individuals interact with neuropsychological testing as well as impacts how intelligence testing has evolved over time.
Today, intelligence tests such as the Wechsler Intelligence Scale for Children (WISC-V; Wechsler, 2014) and the Woodcock-Johnson Tests of Cognitive abilities (WJ-IV; Schrank et al., 2014) are frequently used in both research and clinical endeavors (e.g., to assess learning disabilities and special education service eligibility, Kamphaus et al., 2000; Kranzler et al., 2020). These tests can be administered alone but are often part of a comprehensive neuropsychological battery, including to determine the functional consequences of a traumatic brain injury (Kinsella et al., 1995). Although intelligence tests have utility in providing numeric values to a construct, it is necessary to consider the historical harms of the test and account for a client’s socio-historical background when interpreting results.
The mistrust and harm done in the name of science have lasting implications: people of color are less willing than European Americans to participate in developmental research due to mistrust (e.g., Freimuth et al., 2001). In the context of historical harms, we turn to reviewing literature on when and why generalizability is an important consideration with a focus on how a lack of attention to these factors threatens neuropsychology research and practice.
When and Why Does Generalizability Matter?
One implication of psychology’s exclusionary knowledge base is that many foundational studies are rooted in understanding processes in Western, educated, industrialized, rich, and democratic (WEIRD) participants (Arnett, 2009; Henrich et al., 2010). Henrich et al. (2010) reviewed scientific research across behavioral sciences to assess whether researchers can reasonably generalize from studies using WEIRD samples. They found that addressing research questions using data drawn from a narrow sub-population significantly undermines generalizability. Following from this, an important question is when and why a lack of generalizability is an issue.
Notably, there may be some situations for which generalizability is not a primary concern for a study. If the research objective is limited to interpretation of only the sample population, then generalizability may not be relevant and/or the research question may be best answered using WERID samples, which can require fewer resources and traditional data collection methods (see Mook, 1983). For instance, in qualitative research or case study analysis, the inference process is not applicable, and generalizability is not a relevant goal (Myers, 2000). In pediatric neuropsychology, this may also occur with highly specialized assessments (e.g., ADHD assessments in 8th grade children in the USA). However, even with this prior example focusing on a specific cultural context and age group, racial/ethnic disparities in test sensitivity are present, making generalizability a valid concern (Morgan et al., 2013). When considering the field of pediatric neuropsychology, researchers and clinicians typically seek to make inferences from a sample about a broader population. Thus, a vital question is which populations are represented in a sample and whether this aligns with the claims to whom the findings will extend.
We argue that universalism—the idea that a basic psychological process should apply across populations—is often implied or assumed, even when a test development study may include narrow samples. For example, in examining 5000 published articles across psychology more broadly, titles rarely mentioned regional details when samples were from the USA relative to other WEIRD and non-WEIRD regions (Cheon et al., 2020). In addition, researchers in the USA were more likely to qualify race or ethnicity in titles for research with minoritized samples (e.g., Asian Americans) compared to those with White samples (Cheon et al., 2020; Roberts & Mortenson, 2022). Failing to specify these characteristics in published work may reflect an overall trend of researcher assumptions, one potentially shared by reviewers and editors: effects found in White U.S. populations will generalize to humankind more broadly and testing generalizability in these findings is thus seen as low priority. Illuminating the focus on WEIRD, U.S.-centric samples has led to an expanding body of work that tests the generalizability of what were assumed to be universal truths. Below, we outline the benefits of considering generalizability and review some key approaches to issues of generalizability in neuropsychology.
Threats and Promises of Generalizability in Neuropsychology
As mentioned previously, drawing inferences is an important goal of researchers. A finding that is not generalizable (e.g., neural patterns in attentional control; Hedden et al., 2008) could suggest that the effect applies to only specific populations or groups. And yet, these constraints on generalizability are not always discussed. There are two major implications when a field’s effects or assessments are representative of only a small number of people, particularly without acknowledgment.
First, a lack of generalizability operates as a threat to understanding humankind more broadly. A clear limitation in this regard is that failing to test the generalizability of a given finding severely constrains knowledge of psychological phenomena. For instance, consider an example outlined by Haeffel and Cobb (2022) in testing the cognitive vulnerability to depression hypothesis across five populations. Despite existing theorizing on depression that centers negative cognitions as a potential driver of depressive symptomology (e.g., Alloy et al., 1999), they did not find evidence that the association between cognitive vulnerability and depressive symptoms generalized to Honduran and Nepali participants. They suggest that cognitive vulnerability may confer depression risk in some countries, but not others. Without such an examination, an understanding of how cognitive factors are related to depression may remain incomplete. Relatedly, ungeneralizable assessments and measurements impact researchers’ and clinicians’ ability to capture the construct in question. An inability to measure a construct in different populations serves as a direct challenge to adequately research the construct.
Second, a lack of generalizability is a threat from a practical standpoint. People have a right to valid, reliable assessment. An ungeneralizable field restricts access to this right. When treatments or assessments lack applicability, populations who are underrepresented in research are more likely to be deprived of adequate services. Carrying forward the example of negative cognitions, a treatment that focuses on reframing negative cognitions (e.g., as in CBT, Kaufman et al., 2005), may not be suitable across cultures. Applying ungeneralizable assessments could thus lead to erroneous conclusions or diagnoses. These conclusions and treatment disparities are exacerbated by a limited sampling of racial minority populations during test norming, well acknowledged by the American Academy of Clinical Neuropsychology (AACN, 2021). Consequently, generalizability becomes a social justice issue in that access to tools for efficacious treatment serving all groups, not just WEIRD samples, is needed. Thus, establishing that a lack of generalizability can operate as a threat to the field of neuropsychology.
In contrast, knowing and appropriately considering the generalizability of findings serve to benefit neuropsychology. For instance, knowledge of whether a finding generalizes can improve cultural competency in assessment and practice, subsequently directing resources toward treatments that are highly efficacious across multiple populations. This illustrates the promise of considering generalizability. Testing generalizability can advance both theory and practice if a finding generalizes or fails to generalize. For instance, the lack of generalizability regarding negative cognitions illustrates a potential gap to be explored. From a scientific viewpoint, greater consideration of whether effects generalize can help us understand the why, or mechanisms, behind a phenomenon.
Approaches to Addressing Generalizability in Neuropsychological Assessment
One strategy to address the lack of representation and subsequent potential lack of generalizability in neuropsychological research has been the implementation of demographic norm adjustments. At face value, applying different norms for different populations appears to address the fact that assessments may differentially capture constructs across groups, thus improving diagnostic accuracy. However, there are several major concerns with this approach.
American Academy of Clinical Neuropsychology’s (AACN) position statement in 2021 highlights severe shortcomings in how neuropsychology has historically attempted to address issues of generalizability using race-stratified norms. Notably, the use of race norming erases the diverse drivers of existing disparities and reinforces myths of innate biological differences across people of different races. In reality, there are no discrete differences in genetic variation across racial groups, and conceptualizations of race vary widely across regions, pointing to race as a social category (Smedley & Smedley, 2005). Despite this, race norming was often used to further marginalize Black people in the USA. For instance, in the case of the National Football League (NFL), race norms for dementia systematically placed Black players at a disadvantage for warranted compensation (Possin et al., 2021). Therefore, recommendations by the AACN place emphasis on accounting for environmental factors such as socio-economic status, access to education, disparities in nutrition, the historical impact of racism and discrimination, and cultural differences during test interpretation and overall assessment. Further, mistrust is a predictor of cognitive test performance and other neuropsychological testing outcomes (Terrell et al., 1981). Yet, race norming fails to explicitly account for the fact that medical mistrust is more likely for minoritized groups, such as Black Americans, who have experienced stigma and harm from the medical system.
Past offensive applications of race norming illuminate flaws in demographically stratified norms more broadly. Drawing from intersectional and historical perspectives, issues of generalizability do not simply arise from a need to adapt a measure from one population to another. Rather, it is important to consider socio-contextual factors that impact the measure. For example, in the case of race-stratified norms, AACN outlined how the use of race in norming reinforces a notion of biological differences across racial groups. Yet, the observed differences are not due to internal factors or the demographic category. Rather, there is more evidence that factors such as who developed the measure, which populations it was validated on, and what cultural knowledge it draws from can cause variation in outcomes (see also Gould, 1996 for a similar perspective on intelligence testing). In other words, demographically stratified norms imply that differences exist due to group categories (e.g., race or gender), promoting a narrow view of what causes assessment differences within that domain. One consequence of this practice is reduced attention to why these differences exist and in what ways the assessment fails to generalize.
We propose two important considerations: understanding not only whether a construct generalizes but also how it generalizes. From an inferential statistics perspective, testing generalizability is similar to asking whether a significant effect is observed across multiple samples and thus generalizes to a broader population. Therefore, the assumption of the sample being representative of the population is ingrained in statistical inference. However, to assert this claim in practicality, multiple samples are needed to provide cumulative support for generalizability (see also Bonett, 2012). This testing process in studies is further complicated by a frequent lack of statistical power and relative group size including minoritized samples. Thus, another equally important question is “how”: Is the strength of the effect equivalent across samples? Does the measurement capture a construct in the same way across samples? The latter questions are less frequently examined but remain important components of generalizability. Notably, psychological measurement is rooted in the idea that we can adequately measure internal or behavioral processes. If measurement differs across samples and the ability to detect a significant effect is weakened for some group comparisons, this can impede sufficient tests for generalizability and reduce understanding of the population-level construct of interest.
The Beck Depression Inventory (BDI) provides a clear illustration of why generalizability matters in these two ways. While developed primarily for clinical use, the BDI is widely used across populations. A consequence of its wide use and established reliability is that empirical work using the BDI rarely reports sample-specific reliability information (e.g., only 7.5% of 1200 articles, Yin & Fan, 2000). And yet, the reliability of BDI scores is meaningfully impacted by who is responding to the measure; Yin and Fan (2000) found that BDI reliability was substantially lower among participants who abused substances than among non-clinical participants. Lower reliability can lead to an incomplete understanding of the antecedents and treatment efficacy of depression. For instance, if people who abuse substances tend to show lower reliability in depression scores when using the BDI, it would be more difficult to meaningfully interpret pre-and post-test scores following depression interventions. A substantial change in an individual’s score from pre- to post-test could represent true change or chance change due to measurement error. Similarly, null findings could reflect measurement noise or inability to capture depressive symptomology. Thus, inconsistent reliability estimates across samples can substantially complicate the conclusions of research that tests whether the same factors are predictive of depression across populations.
In sum, generalizability feeds into neuropsychological assessment and practice at multiple levels: from the development and validation of assessment and client–clinician interactions to diagnosis and treatment. Yet, there are scattered guidelines with which to apply knowledge of these issues and several barriers to applying recommended best practices to advance generalizability of the field. With this in mind, we turn to synthesizing recommendations and best practices at each of these levels. We focus on methodological and statistical considerations in assessment research and then move to discuss how clinicians can integrate these into assessment and treatment. In contrast to strategies that focus on only one level of this process—such as adapting assessment scoring using demographic-stratified norms—we advocate for a multipronged approach that integrates generalizability into all stages of the research process.
Advancing Generalizability in Pediatric Neuropsychology
Moving toward a more generalizable field necessitates a shift of research methods/values, additional resources to recruit underrepresented population or increase access for rural or low-income sample pools, and counteracting mistrust created from the prior harms of research on minoritized communities (i.e., avoiding transactional research and cultural incompetence when working with diverse samples and clients). In the context of pediatric research, several additional challenges come into play—recruitment often requires additional resources relative to adult samples, and participation barriers include dual consent/assent in addition to family resources (e.g., circumstances such as limited morning availability of caregiver to accompany children to a lab study). Recent advances in methodological approaches combined with work on cultural competency in neuropsychology practice provide several avenues from which generalizability can be integrated and evaluated.
In this section, we will review concrete recommendations to assist in effectively testing and reporting treatment differences across sample demographics and evaluate whether a given assessment is generalizable across populations. Given unique barriers to these various recommendations, incorporating every recommendation may be an unfeasible goal. Rather, we aim to provide researchers and clinicians with several strategies to overcome some barriers and increase generalizability with the aim of cumulative advances in the field of pediatric neuropsychology. To synthesize these best practices, we reviewed literature on generalizability considerations in neuropsychology and developmental and clinical psychology with a focus on novel approaches to these long-standing problems. Table 1 provides an overview of the recommendations that focus on best practices that serve researchers and clinicians in three main areas of pediatric neuropsychological assessment:
-
1.
Research design: reporting, measuring, and evaluating sample demographics appropriately; strategies for increasing sample and researcher diversity
-
2.
Analytic approaches: best practices for testing group differences within and across samples; statistical considerations for reliability and effect size
-
3.
Evaluation and application of findings: cultural competency in applying assessment; evaluating the generalizability of a scale; ways to look beyond numerical scores
Research Design
At the research design stage, several practices can increase information that aids in evaluating generalizability in neuropsychology assessment. Notably, scholars have long called for population-based and theory-driven research to advance generalizability (Drotar, 1994). That is, prior to collecting data, researchers should actively consider the population of interest and design studies with those considerations in mind. In addition, theory testing relies on clearly specifying such populations, including whether a given theory is assumed to generalize to other groups. Later in the research design stage, sample demographics for recruitment are closely integrated to improve generalizability. Sample demographics are typically defined as quantitative assessments of sample characteristics including, but not limited to age, gender/sex, race/ethnicity, and socioeconomic status. Thus, our first two recommendations in Table 1 align with increasing sample (and researcher) diversity along with measuring and reporting sample demographics.
First, in terms of increasing sample and researcher diversity in published literature, we recommend that research on pediatric neuropsychological assessment continue to expand recruitment strategies. Expanding recruitment strategies involves intentionality of sample selection (i.e., evaluating who the target sample is and whether that aligns with the question or is a sample of convenience) and intentionality of recruitment outlets (i.e., multimethod recruitment strategies that reach multiple communities/labs). Related to issues of sample diversity, one commonly cited barrier to representative samples is that underrepresented populations are “hard to reach.” However, one reframe of this barrier is that underrepresented populations are labeled as hard to reach because research with these samples requires a shift from traditional methods used with Western or White samples (e.g., convenience sampling, Sears, 1986; see also Lange, 2019). This is supported by a review of top psychology journals: White first authors were less likely than first authors of color to publish research with participants of color (Roberts et al., 2020). In addition, psychology often has a practice of excluding theoretical approaches that challenge dominant norms about how research should be conducted (Buchanan et al., 2021; Settles et al., 2020). Thus, White researchers and other researchers of privilege likely continue to employ research practices that work to exclude underrepresented samples in the field of neuropsychology. Greater variation of research and recruitment methods can move pediatric neuropsychology into a more generalizable science. We review three commonly proposed strategies—remote platforms, team science, and community based-research—and discuss their feasibility for use in pediatric neuropsychology.
Online crowdsourcing platforms (e.g., Amazon’s Mechanical Turk, Prolific) have been proposed as one way to increase generalizability of non-clinical samples relative to student samples (see Henrich et al., 2010; Ledgerwood et al., 2017). These are not without flaws, however, and the age and abilities of pediatric samples restrict researchers’ options. Most online recruitment platforms exclude participants under 18 years of age, and many young participants may not be able to read written instructions and follow complex instructions or text-based stimuli. The advent of the COVID-19 pandemic did bring several remote platforms geared toward pediatric samples. These platforms have potential to reduce barriers to recruitment of underrepresented populations (e.g., Lookit, Nelson & Oakes, 2021; Scott & Schulz, 2017, and TheChildLab, Sheskin & Keil, 2018). While early work points to some validity of online protocols for neuropsychological testing/teleneuropsychology in adults (e.g., Requena-Komuro et al., 2022; Saban & Ivry, 2021) and pediatric samples (Hamner et al., 2022; Salinas et al., 2020), additional work is needed to better validate and establish protocols for online testing.
Drawing from discussions on the promises of team science (Hall et al., 2018) and increased funding for interdisciplinary collaborations (e.g., NIH BRAIN Initiative, Insel et al., 2013; Koroshetz et al., 2018), we believe increased collaborations and large-scale projects on understanding generalizability of assessments provide a promising path forward to increasing both sample and researcher diversity. Notably, greater reliance on team science would enhance investigations that allow for large-scale, in-person data collection with more representative samples. Multiple labs can participate and diversify the use of community samples to levels that would be unfeasible for one lab to reach.
For instance, developmental psychologists have called for the creation of organizations like the Psychological Science Accelerator (PSA; Moshontz et al., 2018) aimed toward consolidating resources, establishing shared infrastructure, and increasing research outreach by providing participants with internet access (see Sheskin et al., 2020). The ManyBabies Consortium provides additional support to the validity of this approach for advancing generalizability. Past work demonstrates that this consortium increases diversity of infant research with data that reflects the population of interest for most infant studies (“human infants”; Byers-Heinlein et al., 2020; Visser et al., 2022). Furthermore, it provides support for national research across underfunded labs, thus enhancing diversity of both teams and samples. In addition to increasing sample diversity, team-based science thus has the potential to counteract uniformity biases by incorporating increased perspectives that feed into greater diversity of what research questions are focused on and how constructs are measured. To our knowledge, there are no existing organizations that serve this function within pediatric neuropsychology.
A third promising method is community-based participatory research (CBPR). This technique involves collaboration with people who have lived experience recruited via outreach (community events, marketing, and word of mouth). Participants’ input is not only incorporated, but actively drives the path of the research. Although this approach has been used with adults (O’Bryant et al., 2021), it has not been discussed at length in the context of pediatric neuropsychology. We recommend clinicians and researchers consider the value of adapting this approach to pediatric neuropsychology, informed by best practices in the health field more broadly (see Israel et al., 2005 for a review). Parents play an important role in encouraging their child’s participation, and thus, CBPR could dismantle mistrust issues and offer logistical flexibility. Trust in clinicians, parent satisfaction with pediatric neuropsychological evaluations, and greater parental implementation of report recommendations have been noted as pressing issues for the field and vary across demographic factors including income (Elias et al., 2021; Gross et al., 2001; for systematic reviews see Fisher et al., 2022, Spano et al., 2021). While reducing the barriers that may present in participation of diverse groups requires stepping out of the comfort of convenience sampling, CBPR provides a well-supported option.
With respect to our second recommendation in Table 1, recent reports suggest greater attention to participant demographics in neuropsychology research (e.g., Raad et al., 2008), but reporting of demographics can continue to improve. Several current best practices exist in reporting and measuring participant demographics. The Journal Article Reporting Standards recommends reporting demographic groups in both the abstract and methods, selection methods, and whether exclusions may inadvertently restrict demographic characteristics (Appelbaum et al., 2018). As an example of current recommended best practices in measurement, Call et al. (2022) outline a social justice approach to measuring and using demographic data, which includes important recommendations such as avoiding the term “Other” in measurement (e.g., using “Not listed” or open-ended data when multiple-choice responses are inadequate, Cameron & Stinson, 2019) and informing participants of the importance and way in which their demographic data will be used. When possible, it is also a best practice to use terms that align with how an individual describe themselves (e.g., adults with autism in English-speaking countries generally prefer autism before the person whereas adults with autism in Dutch-speaking countries prefer people-first language, Buijsman et al., 2022; see also Monk et al., 2022). In the context of pediatric samples, careful acknowledgement should be given to how children and adolescents describe themselves, relative to caregiver-ascribed labels.
Because social categories differ across cultures, it is important to acknowledge that demographic measurement should be assessed at the design stage to determine what is appropriate for the context. And, there may be instances where collection of demographic information is limited (e.g., using data from medical records or brief intake information). Thus, our goal is not to establish a one-size-fits-all recommendation for how to measure demographics; there are multiple inclusive measures that will depend on the context and groups being assessed. Rather, we raise these examples as demonstrating the breadth of research in this area. A general approach to assessing participant demographics is first to consider the context or region of the sample and second to identify the relevant demographics based on past research, and third to develop items that appropriately assess these demographics considering changes in language over time and cultural norms.
Going a step further on this issue of sample demographics, reporting sample demographics is often understood as including participant characteristics within a method section, leaving out a vital connection: contextualizing findings within sample demographics. It has been proposed that publications include sections to outline constraints on generalizability in the context of sample characteristics (e.g., see Simons et al., 2017). Two areas in which contextualization of samples is severely lacking in psychology more generally are in labeling the roles of whiteness and U.S. centrism in research findings. As discussed previously, findings with White participants in the USA are often discussed with generic language that implies White = humankind (Roberts & Mortenson, 2022). As another example, Salter and Adams (2013) outline how APA’s bias-free language excludes discussion of terminology for White people. These types of exclusions reinforce the idea that race and racism are marginal constructs that do not apply to White Americans, leading to greater invisibility of the implications of white-, Western-centered neuropsychology. Reporting racial demographics in titles and abstracts even when the sample is primarily White helps dismantle this. Yet, this strategy alone also falls short (see also Remedios, 2022). A more generalizable science necessitates interrogating the positionality of power among both researchers and samples. Said another way, research conducted on White or Western samples, by White or Western researchers, can enhance generalizability by not only reporting sample information but actively making the implications of this visible (e.g., including discussions of positionality or a section on the constraints of generalizability). Within pediatric neuropsychology, these additions can provide vital information to clinicians about potential limits on the applicability of findings.
In sum, the inclusion of diverse sampling and research teams as well as maintaining reporting in sample demographics has the potential to increase generalizability and direct attention to the constraints of WEIRD samples. We discussed several potential obstacles in increasing sample diversity and strategies to overcome these obstacles in the context of research design and reporting—notably, team science collaborations and the use of computer-based implementations. We recommend that journal editors, grant reviewers, and journal reviewers encourage researchers to report their sample demographics and limits of generalizability. In addition, clinicians should conscientiously assess sample demographics to calibrate their confidence in generalizing the findings of research. We next turn to recommendations for analytic approaches seeking to illuminate group differences or constraints on generalizability.
Analytic Approaches
We recommend two broad analytic best practices for advancing generalizability in neuropsychological research. First, as summarized in Table 1, researchers should consider testing and reporting effects across potentially informative sample demographics. Measurement invariance can be used to aid in this practice and assess the extent to which a construct is psychometrically equivalent across different contexts such as between-group or across-time comparisons (see Putnick & Bornstein, 2016 for a review of measurement invariance in psychology). Researchers typically seek to establish measurement invariance prior to testing for group comparisons to rule out the possibility that differences are due to psychometric differences (e.g., groups interpreting the items in different ways). Thus, establishing invariance is a way to increase confidence in the validity of interpreting group differences.
Measurement invariance is often tested using multigroup confirmatory factor analysis. Multiple levels of invariance can be met, ranging from configural invariance (whether the model or basic factor structure of the scale holds across conditions), metric invariance (whether the items contribute to the construct in the same way across conditions or show equivalent factor loadings), scalar invariance (equivalence of item intercepts, meaning that the overall levels of items are the same across conditions), to strict invariance (the items’ errors or residuals are equivalent). Most scholars argue for meeting scalar invariance at a minimum to compare mean level differences across groups, although strict invariance may be necessary for comparing observed composite scores (Cheung & Rensvold, 2002; Steinmetz, 2013; van de Schoot et al., 2012).
Measurement invariance provides a pathway to testing generalizability across populations. For instance, Avila et al., (2020) had over 6000 participants complete a neuropsychological battery to compare testing invariance across ethnicity/race and sex/gender. Their results suggest full invariance for most tests, except for language domain tests (see also Tuokko et al., 2009, for similar findings regarding language function tests such as verbal ability). Although invariance is typically discussed in neuropsychology publications focused on scale psychometrics, increased reporting regardless of the aims is advised. At an empirical level, increasing reporting of measurement invariance would aid in understanding the extent to which certain scales may or may not generalize across populations, as well as determine whether testing for between-group differences is appropriate. Structural equation modeling approaches comparing goodness-of-fit indicators are most common for testing invariance, but item response theory is another strategy that is increasingly employed (Millsap, 2010; Reise et al., 1993). Researchers looking to conduct invariance testing and clinicians looking to evaluate invariance of scales can refer to van de Schoot et al. (2012) for a practical checklist. Other scholars have developed comprehensive tutorials for researchers in the freely available statistical program, R (Fischer & Karl, 2019; Hirschfeld & Von Brachel, 2014).
However, an important limitation in measurement invariance and testing differences across groups is that group differences across demographics are generally used as a proxy for something (e.g., discrimination or cultural values). Thus, testing group differences across demographics can overlook what the demographic characteristics are being used as a proxy for and whether that construct should be measured instead. Notably, proper invariance testing approaches may also necessitate greater use of supplementary materials (to report invariance information) and/or open data practices (to provide the data for others to conduct invariance testing) in journals.
Another key obstacle to properly addressing and testing group differences is that research on underrepresented populations tends to use smaller sample sizes due to decreased infrastructure in recruiting these populations. This can easily result in an erroneous conclusion: observing a significant association in one group, but not in another, and concluding this is evidence for a lack of generalizability. In reality, one possibility for this occurrence is that there is sufficient power to detect such a difference in an overrepresented group but smaller sample size of another group which results in reduced statistical power for the same test within that group. For instance, in one prior reference to unpublished work, Haeffel and Cobb (2022) suggest evidence of a lack of generalizability for some cognitive theories of depression. To illustrate when evidence for this type of claim would be invalid, we will walk through one analytic approach: this unpublished work could have sampled across multiple countries and tested the association between cognitive vulnerability within each subsample, finding a significant association in most countries but no significant association among Nepali and Honduran participants. If data collection in Nepal and Honduras included far fewer participants than the other countries tested, significance in one group and a lack of significance in the other may not reflect the effect failing to replicate but instead the limitations of null hypothesis testing and sample size.
When invariance testing or sufficient sample sizes are unfeasible, researchers who employ this type of comparison should at a minimum examine effect sizes (e.g., r2) and provide additional information about the equivalence of the two effects. As an example of best practice in this area, McCaffrey et al. (2023) report effect sizes alongside p-values, examine mean score equivalency, and take into account the role of lower statistical power in their comparisons. We thus advocate for reporting effect sizes for all comparisons regardless of significance as listed in Table 1 (see also Schatz et al., 2005). This practice would aid in more meaningful understanding of comparisons beyond significance and provide information for future sample size determinations. Lab collaborations, a strategy discussed earlier, would also help increase feasibility of testing for group differences using appropriate sample sizes.
In our second analytic recommendation, we advocate that reliability information should always be examined at a sample level, rather than presumed based on previous work. As illustrated with the BDI, the reliability of assessments is a sample-specific property and can vary according to the individuals being assessed. While psychometric-focused studies can provide valuable information regarding assessment constraints, this is also something that can be emphasized in empirical work. Considering the reliability of tests in a sample is important because lower reliability can result in lower statistical power. In addition, comparing the effect sizes of two identical studies has little meaning when the reliability of measures differs (Parsons et al., 2019). In assessing the generalizability of effects, reliability can be a component of interest (i.e., how a scale generalizes across populations) or a confounder (i.e., impairing comparisons of group scores).
When assessments are developed and validated with specific populations, reliability of these assessments may be lower for the global majority. This can in turn exacerbate the potential of null findings due to lower statistical power, compounded with typically smaller samples for minoritized populations. Said another way, simply comparing the significance of hypothesis tests across groups is insufficient and may reflect method-specific variation (in terms of differences in reliability or group sizes). At an empirical level, researchers should thus calculate and report appropriate reliability information for all primary scales (e.g., Nazaribadie et al., 2021, report prior reliability and sample-specific reliability information for the Wisconsin Card Sorting Test).
In sum, we have put forth two analytic approaches that would aid in generalizability considerations across neuropsychological assessment: further analyses that test for group differences, when feasible given the sample, and testing reliability of tests in all empirical work. Yet, the statistical validity discussed above is not an endpoint; results of assessments, even if internally valid and reliable, should consider the clients’ larger context (e.g., see Gale et al., 2022). Our final set of recommendations focuses on applying knowledge of generalizability in clinical practice and evaluating research.
Evaluation and Application of Findings
Thus far, our reviewed recommendations have focused on best practices to improve information to advance or test for generalizability (e.g., reporting reliability/demographic information, testing for group effects, increased use of novel methods and collaborative data), with few implications for evaluating generalizability from an applied perspective. Now, we discuss some ways in which clinicians and reviewers/editors of research can aid in generalizability evaluation and properly apply such findings.
In evaluating the generalizability of assessments, systematic reviews and meta-analyses can provide informative value to clinicians, to the extent that null findings are included. Kopp et al. (2019), for example, provide evidence that split-half reliability for the Wisconsin Card Sorting Test (WCST) had likely been underestimated in previous work with non-clinical adult samples. Their findings with neurological inpatients demonstrated high split-half reliability for the WCST, suggesting that this assessment maintains reliability with clinical samples but should be used with caution in non-clinical samples. This has clear implications for clinical use, and a researcher could integrate this knowledge by selecting alternative measures of cognitive flexibility for a non-clinical sample. In terms of pediatric fields, reviews focused on particular cognitive domains such as executive functioning in children can also be fruitful. For example, Souissi et al. (2022) comprehensively reviewed the leading theories of executive functioning and how it is measured in children. They further make recommendations for researchers and clinicians on how to better assess this domain using a multifaceted approach (performance-based tests along with ecologically valid measurement) and accounting for contextual (e.g., SES) as well as cultural (e.g., bilingualism) factors. Systematic reviews can thus provide a wealth of information for clinicians to integrate and present a snapshot of which tests might be most generalizable or appropriate for a given population. Further, the review above (Souissi et al., 2022) suggests tangible practices for applying this knowledge to assessment. Thus, large-scale reviews can both synthesize research gaps as well as inform clinical practice.
From a research point of view, improving generalizability is more likely when there are structural aids to achieve and incentivize the strategies outlined in the sections above. Neuropsychology shows promise here: reports from neuropsychological test batteries are often published regardless of statistical significance or novelty (Lange, 2019), which could mean there is minimal publication bias in scale evaluation literature. And, before using neuropsychological assessments with pediatric samples, assessments are often carefully evaluated across age and validated in target age groups (e.g., Pearson, 2018). However, across psychology, publishing is a persistent pressure. And, while many of the strategies to increase sample size or expand sample diversity address cost limitations, they do not address the resource of time. Using these methods may take greater time and effort on the part of researchers, and more structures that incentivize the time to make this shift are needed (e.g., special issues on generalizable methods). A lack of incentives can thus limit the availability of information from which to evaluate generalizability.
Another important consideration with structural incentives is that novelty and null findings should be equally published in both high-impact and specialty journals to maintain structures that incentivize reporting null findings and extensions. Pre-registration and registered reports are two approaches that are aimed at minimizing bias against null findings (Gonzales & Cunningham, 2015). Although much discussion has focused on how individual researchers can implement pre-registration, these approaches can be incentivized at the journal level to reduce systemic or peer review bias toward null and novel findings (e.g., special registered report submission tracks). Similarly, reviewers can place more weight on the importance of sample diversity, regardless of the research question novelty or encourage sample size justification with WEIRD samples. That is, generalizability considerations could receive more value if they are viewed not just as an incremental contribution but instead as a novel approach even if the research question has been asked with narrow samples before.
From a clinician point of view, applying these findings into practice requires a holistic approach to assessment. Our final two recommendations in Table 1 focus on the applicability of research on generalizability. During client and collateral interviews, clinicians should strive to paint a comprehensive picture of a client’s socio-historical background with special attention to historical reasons for treatment-seeking gaps and mistrust compiled throughout the years. Cultural values and norms that are unique to the client should be thoroughly understood by the clinician to account for variability in test scores attributed to ethnic-cultural and linguistic diversity. Information regarding the socio-cultural and historical background of a child is of paramount importance not only for better case conceptualization but also for evaluation of assessment fit in terms of both cultural loading and linguistic demands (for recommendations, see Pedraza et al., 2020, and the ECLECTIC approach; Fujii, 2018). This information should drive the clinician’s judgment of choosing assessments that would be representative of the client’s neuropsychological functioning.
In addition to evaluating psychometrics and invariance testing, clinicians should thus pay close attention to the sampling population of the test and stay informed on recent literature in line with Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014). Failing to acknowledge factors such as cultural loading and using inappropriate tests introduces systematic biases, including barriers in test takers’ abilities to demonstrate their true level of functioning. Misrepresenting current neuropsychological function can further lead to misdiagnosis and call into question the overall validity of neuropsychological assessment among children and families. This underscores the ultimate takeaway for clinicians, which is to understand the numbers from the tests in a multilayered context of symptoms, cultural heritage, linguistic diversity, and broader socio-economic factors.
One area that we identify as particularly important in the coming years is the treatment of linguistically diverse individuals (see also Manly, 2008). Many of the papers reviewed above showed limitations in the generalizability of verbal or language-related tests and instruction interpretation. When working with linguistically diverse children and adolescents, interpreters who are well-trained in cultural influences on neuropsychological assessment as well as bilingual psychometrics are advised to be used. Whenever possible, the testing should be performed in the client’s dominant language with respective translated tests and norms (e.g., with Spanish-speaking clients, the use WISC-V Spanish version is appropriate).
Another recommendation for clinicians is to evaluate the psychometric properties of the assessments they use and consider the assessment context on an individual basis. In particular, the testing context can evoke stress and enhanced vigilance, particularly among members of stigmatized groups (e.g., due to discriminatory and threat potential). Stress and cognitive reserve are independently associated with neuropsychological performance in healthy elderly adults (Cabral et al., 2016). And, while past research shows stress can, unintuitively, reduce Stroop interference through reduced cue utilization (Booth, 2019), Richeson and Trawalter (2005) further demonstrate that interracial contact can lead to greater Stroop interference due to self-regulatory demands. This has implications for record keeping during assessment and suggests that it is important to consider whether the test administrator shares or does not share social identities with the child and whether those dynamics may have influenced performance. This also makes the case for repeat testing when resources allow and expanding diversity of test administrators.
Evaluating an Assessment: the Test of Memory Malingering
One of the most well-validated and frequently used performance validity tests (PVT) is the Test of Memory Malingering (TOMM; Schroeder et al., 2016; Tombaugh, 1996), used by 75–78% of North American neuropsychologists (Martin et al., 2015). The normative data for the TOMM was taken from a community sample of English speakers aged 16–84 years, with an education level of 8–21 years in addition to a clinical sample. Similarly, Rees et al. (1998) conducted a series of experiments to validate the TOMM in several groups (college students, people with traumatic brain injury, and cognitively impaired older adults) which broadens the bandwidth of generalizing the TOMM. However, these early validation studies failed to report any ethnicity-related information for their samples, which significantly comprised a recent meta-analysis of the TOMM (Martin et al., 2019).
Given the widespread use of the TOMM, it has now been cross-culturally validated in diverse samples (Grabyan et al., 2017; Wisdom et al., 2012)—for example, in Romanian (Crisan & Erdodi, 2022) and Singaporean clinical samples (Chan et al., 2021). However, there are some mixed results regarding the influence of demographics on TOMM performance. Nijdam-Jones et al. (2019) analyzed TOMM performance in a sample from Columbia who had varying years of education and age (ranged from 18 to 84 years old). The results showed that participants who lacked formal education scored significantly lower than participants who had formal education (12 years or higher) and had an invalid performance on TOMM Trial 1. In addition to education, age also influenced TOMM performance. Younger participants had significantly higher and more valid TOMM scores than older participants. And, in another study in which participants belonged to seven different countries in Latin America, age (18–95 years) and education continued to influence TOMM performance (Rivera et al., 2015).
While TOMM has been examined in pediatric samples, there has been limited targeted work looking at demographic differences (e.g., language fluency) in younger samples. Some work with pediatric samples suggests a lack of validity with children younger than 5 years of age with ADHD, particularly in the Retention trial (Schneider et al., 2014). And, although this study showed that performance on TOMM was affected by age (4–7 years old), 93% of their sample reached performance comparable to adults by Trial 2, which is often used for interpretation. In terms of other age effects, children aged 5–7 performed better than those aged 4 on Trials 2 and 3. However, there was no difference between these age groups for Trial 1. Similarly, a recent study examined associations with TOMM scores and demographics that found no evidence for differential effects across age, gender, ethnicity, and English proficiency among the many other demographics assessed (Ku et al., 2020). Moreover, a cross-cultural study (USA and Cyprus) has also validated the robustness of TOMM relative to variability in demographics (Constantinou & McCaffrey, 2003). However, the influence of demographics on TOMM performance in a pediatric sample was not the primary research question in these articles; hence, further research is warranted on the validity in TOMM use.
The purpose of reviewing this literature is to accentuate the importance of reporting the full extent of the sample demographics used to validate a test so that clinicians can evaluate the applicability of the test to the client. While TOMM validity has been extensively researched across samples, we see two paths through which generalizability can be improved. First, we highly recommend diversifying samples. According to the 2050 American Academy of Clinical Neuropsychology initiative, testers will become unequipped to handle 60% of the population due to an increase in non-primary English speakers and non-European Americans. Hence, it is imperative to increase sample diversity to broaden the generalizability horizon and facilitate health equity. Additionally, the TOMM as a PVT taps into non-credible effort. Past work suggests that effort can vary as a function of how cultures are valued (or unvalued) in the context of the Westernization of neuropsychological assessment. For instance, in interviews with Māori participants in New Zealand, two prevalent themes were the experience of cultural invisibility and a cultural divide between the client and neuropsychologists (Dudley et al., 2014). Clients who feel low cultural safety may feel less motivated to put in credible effort, therefore showing poorer performance on the TOMM and increasing the likelihood of adverse conclusions (e.g., malingering). Thus, although knowing whether non-credible effort is an issue, it is also practically meaningful to know why it might be an issue. One recent study aimed at abridging this gap in the literature showed that the Pediatric PVT Suite (PdPVTS; McCaffrey et al., 2020) pass/fail classifications were not correlated with gender or race/ethnicity (McCaffrey et al., 2023). This might be a fruitful area for future research, which would provide avenues to improve neuropsychological testing instruction and preparation as well as integrate cultural variation in values and counteract medical distrust.
Conclusion
The generalizability of findings in neuropsychology has implications both for adequate theory building (e.g., understanding of human behavior) and practice (e.g., appropriate diagnostic information). Despite increasing calls for cultural competence in neuropsychology, additional resources are needed to expand the research available to inform culturally competent practice. One piece of this involves the generalizability of assessments and research findings. To improve the extent to which the field incorporates multiple perspectives and samples, several practices are needed from research design to analysis to clinical practice and diagnosis.
To aid in this goal, we reviewed a set of concrete recommendations that we believe would enhance generalizability of neuropsychology, with a focus on assessment and measurement. Importantly, our recommendations are far from exhaustive and, as mentioned previously, incorporating every recommendation may be an unfeasible goal. Furthermore, generalizability goals are not a static endpoint, but rather a continuous process: the practices a field engage in must continue to develop in line with a continually changing society. As an example, at present, there is virtually no known data to address the normative inclusion of variations in gender identity, and instead, cisgender is the default. What is unclear is if this would affect the utility of testing in non-conventional, cisgender individuals, including differences in mean scores, variability in scores, measurement properties, equivalence in content, or comfort with the testing environment. While this is a growing area (e.g., one recent study found neuropsychologists tend to underutilize affirming practices for LGBTQ + clients; Correro et al., 2022), research will need to examine these and many other issues. As our understanding and practices evolve, clinicians and researchers will have to consider many previously neglected factors that may prove critical to the generalizability of any given psychological construct. We aimed to provide a foundational knowledge base and toolbox from which researchers and clinicians can draw from, subsequently improving the field’s approach to these issues.
Notes
In assessing approaches and evaluations of generalizability in psychology, we conducted a literature review using the following primary keywords: generalizability, external validity, and sample diversity. We further aimed to include perspectives that approached generalizability from an assessment standpoint and searched the following terms: generalizability in (pediatric) neuropsychology, cultural competency, and assessment generalizability. Citation decisions for our review were based off establishing an approximately equal balance between highly cited works and newer, recent works. The first author has experience in assessment and cultural competency research, and the second author has expertise in generalizability-related research and sample demographics in psychology more broadly. These two perspectives informed both our review of the literature and synthesis of recommendations/best practices.
References
Alloy, L. B., Abramson, L. Y., & Francis, E. L. (1999). Do negative cognitive styles confer vulnerability to depression? Current Directions in Psychological Science, 8(4), 128–132. https://doi.org/10.1111/1467-8721.00030
American Association for Clinical Neuropsychology (AACN). (2021). Position statement on use of race as a factor in neuropsychological test norming and performance prediction. Retrieved April 26, 2023 from: https://theaacn.org/wp-content/uploads/2021/11/AACN-Position-Statement-on-Race-Norms.pdf
American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing (U.S.). (2014). Standards for educational and psychological testing. Washington, DC: AERA
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191
Ardila, A. (2005). Cultural values underlying psychometric cognitive testing. Neuropsychology Review, 15(4), 185–195. https://doi.org/10.1007/s11065-005-9180-y
Arnett, J. J. (2009). The neglected 95%, a challenge to psychology’s philosophy of science. American Psychologist, 64(6), 571–574. https://doi.org/10.1037/a0016723
Avila, J. F., Rentería, M. A., Witkiewitz, K., Verney, S. P., Vonk, J. M. J., & Manly, J. J. (2020). Measurement invariance of neuropsychological measures of cognitive aging across race/ethnicity by sex/gender groups. Neuropsychology, 34(1), 3–14. https://doi.org/10.1037/neu0000584
Boake, C. (2002). From the Binet-Simon to the Wechsler–Bellevue: Tracing the history of intelligence testing. Journal of Clinical and Experimental Neuropsychology, 24(3), 383–405. https://doi.org/10.1076/jcen.24.3.383.981
Bonett, D. G. (2012). Replication-extension studies. Current Directions in Psychological Science, 21(6), 409–412. https://doi.org/10.1177/0963721412459512
Booth, R. W. (2019). Reduced Stroop interference under stress: Decreased cue utilisation, not increased executive control. Quarterly Journal of Experimental Psychology, 72(6), 1522–1529. https://doi.org/10.1177/1747021818809368
Buchanan, N. T., Perez, M., Prinstein, M. J., & Thurston, I. B. (2021). Upending racism in psychological science: Strategies to change how science is conducted, reported, reviewed, and disseminated. American Psychologist, 76(7), 1097–1112. https://doi.org/10.1037/amp0000905
Buijsman, R., Begeer, S., & Scheeren, A. M. (2022). ‘Autistic person’ or ‘person with autism’? Person-first language preference in Dutch adults with autism and parents. Autism. [advance online publication] https://doi.org/10.1177/13623613221117914
Byers-Heinlein, K., Bergmann, C., Davies, C., Frank, M. C., Hamlin, J. K., Kline, M., Kominsky, J. F., Kosie, J. E., Lew-Williams, C., Liu, L., Mastroberardino, M., Singh, L., Waddell, C. P. G., Zettersten, M., & Soderstrom, M. (2020). Building a collaborative psychological science: Lessons learned from ManyBabies 1. Canadian Psychology / Psychologie Canadienne, 61(4), 349–363. https://doi.org/10.1037/cap0000216
Cabral, J. C. C., Veleda, G. W., Mazzoleni, M., Colares, E. P., Neiva-Silva, L., & Neves, V. T. D. (2016). Stress and cognitive reserve as independent factors of neuropsychological performance in healthy elderly. Ciencia & Saude Coletiva, 21, 3499–3508. https://doi.org/10.1590/1413-812320152111.17452015
Call, C., Eckstrand, K., Kasparek, S. W., Boness, C. L., Blatt, L. R., Jamal-Orozco, N. P., ... & Foti, D. (2022). An ethics and social justice approach to collecting and using demographic data for psychological researchers. Perspectives on Psychological Science [advance online publication]. https://doi.org/10.1177/174569162211373
Cameron, J. J., & Stinson, D. A. (2019). Gender (mis) measurement: Guidelines for respecting gender diversity in psychological research. Social and personality psychology compass, 13(11), e12506. https://doi.org/10.1111/spc3.12506
Chan, I., Shu Min Ong, I., & Gwee Dpsych, K. (2021). Validation of the Test of Memory Malingering in a clinical population from Singapore. International Journal of Forensic Mental Health, 20(1), 1–16. https://doi.org/10.1080/14999013.2020.1789244
Cheon, B. K., Melani, I., & Hong, Y. (2020). How USA-centric is psychology? An archival study of implicit assumptions of generalizability of findings to human nature based on origins of study samples. Social Psychological and Personality Science, 11(7), 928–937. https://doi.org/10.1177/1948550620927269
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255.
Constantinou, M., & McCaffrey, R. J. (2003). Using the TOMM for evaluating children’s effort to perform optimally on neuropsychological measures. Child Neuropsychology, 9(2), 81–90. https://doi.org/10.1076/chin.9.2.81.14505
Correro, A. N., Hinrichs, K. L., Krishnan, M. C., Cottingham, M. E., Trittschuh, E. H., Parmenter, B. A., ... & Stelmokas, J. (2022). Neuropsychological assessment with lesbian, gay, bisexual, transgender, and queer/questioning (LGBTQ+) individuals: Practice, education, and training survey. The Clinical Neuropsychologist, 1–43. https://doi.org/10.1080/13854046.2022.2148379
Crişan, I., & Erdodi, L. (2022). Examining the cross-cultural validity of the Test of Memory Malingering and the Rey 15-item test. Applied Neuropsychology: Adult, 1–11. https://doi.org/10.1080/23279095.2022.2064753
De Jesús-Romero, R., Amani Holder-Dixon, A., & Lorenzo-Luaces, L. (2022). Reporting and representation of racial and ethnic diversity in randomized controlled trials of internet-based cognitive-behavioral therapy (iCBT) for depression. Retrieved from https://doi.org/10.31234/osf.io/kfnhm
Drotar, D. (1994). Psychological research with pediatric conditions: If we specialize, can we generalize? Journal of Pediatric Psychology, 19(4), 403–414. https://doi.org/10.1093/jpepsy/19.4.403
Dudley, M., Wilson, D., & Barker-Collo, S. (2014). Cultural invisibility: Māori people with traumatic brain injury and their experiences of neuropsychological assessments. New Zealand Journal of Psychology, 43(3), 14–21.
Elias, J., Zimak, E., Sherwood, A., MacDonald, B., Lozano, N., Long, J., & Larsen, A. D. (2021). Do parents implement pediatric neuropsychological report recommendations? The Clinical Neuropsychologist, 35(6), 1117–1133. https://doi.org/10.1080/13854046.2020.1720298
Fischer, R., & Karl, J. A. (2019). A primer to (cross-cultural) multi-group invariance testing possibilities in R. Frontiers in psychology, 1507. https://doi.org/10.3389/fpsyg.2019.01507
Fisher, E. L., Zimak, E., Sherwood, A. R., & Elias, J. (2022). Outcomes of pediatric neuropsychological services: A systematic review. The Clinical Neuropsychologist, 36(6), 1265–1289. https://doi.org/10.1080/13854046.2020.1853812
Freimuth, V. S., Quinn, S. C., Thomas, S. B., Cole, G., Zook, E., & Duncan, T. (2001). African Americans’ views on research and the Tuskegee Syphilis Study. Social Science and Medicine, 52, 797–808. https://doi.org/10.1016/S0277-9536(00)00178-7
Fujii, D. E. M. (2018). Developing a cultural context for conducting a neuropsychological evaluation with a culturally diverse client: The ECLECTIC framework. The Clinical Neuropsychologist, 32(8), 1356–1392. https://doi.org/10.1080/13854046.2018.1435826
Gale, R., Fellner, K., Tomlinson, G., & Danto, D. (2022). Beyond appropriate norms: Cultural safety with indigenous people in Canadian neuropsychology. Journal of Concurrent Disorders. https://doi.org/10.54127/WOUM1515
Gara, M. A., Vega, W. A., Arndt, S., Escamilla, M., Fleck, D. E., Lawson, W. B., Lesser, I., Neighbors, H. W., Wilson, D. R., Arnold, L. M., & Strakowski, S. M. (2012). Influence of patient race and ethnicity on clinical assessment in patients with affective disorders. Archives of General Psychiatry, 69(6), 593–600. https://doi.org/10.1001/archgenpsychiatry.2011.2040
Gonzales, J. E., & Cunningham, C. A. (2015). The promise of pre-registration in psychological research. Psychological Science Agenda, 29(8), 2014–2017. Retrieved April 26, 2023 from: https://www.apa.org/science/about/psa/2015/08/pre-registration
Gould, S. J. (1996). Mismeasure of man. WW Norton & company.
Grabyan, J. M., Collins, R. L., Alverson, W. A., & Chen, D. K. (2017). Performance on the Test of Memory Malingering is predicted by the number of errors on its first 10 items on an inpatient Epilepsy Monitoring Unit. The Clinical Neuropsychologist, 32(3), 468–478. https://doi.org/10.1080/13854046.2017.1368715
Gross, D., Julion, W., & Fogg, L. (2001). What motivates participation and dropout among low-income urban families of color in a prevention intervention? Family Relations, 50, 246–254. https://doi.org/10.1111/j.1741-3729.2001.00246.x
Guthrie, R. V. (2004). Even the rat was white: A historical view of psychology (2nd ed.). London: Pearson Education.
Haag, H. L., Biscardi, M., Smith, N. N., MacGregor, N., & Colantonio, A. (2019). Traumatic brain injury and intimate partner violence: Addressing knowledge and service gaps among indigenous populations in Canada. Brain Impairment, 20(2), 197–210. https://doi.org/10.1017/BrImp.2019.16
Haeffel, G. J., & Cobb, W. R. (2022). Tests of generalizability can diversify psychology and improve theories. Nature Reviews Psychology, 1(4), 186–187. https://doi.org/10.1038/s44159-022-00039-x
Hall, K. L., Vogel, A. L., Huang, G. C., Serrano, K. J., Rice, E. L., Tsakraklides, S. P., & Fiore, S. M. (2018). The science of team science: A review of the empirical evidence and research gaps on collaboration in science. American Psychologist, 73(4), 532–548. https://doi.org/10.1037/amp0000319
Hamner, T., Salorio, C. F., Kalb, L., & Jacobson, L. A. (2022). Equivalency of in-person versus remote assessment: WISC-V and KTEA-3 performance in clinically referred children and adolescents. Journal of the International Neuropsychological Society, 28(8), 835–844. https://doi.org/10.1017/S1355617721001053
Hedden, T., Ketay, S., Aron, A., Markus, H. R., & Gabrieli, J. D. (2008). Cultural influences on neural substrates of attentional control. Psychological Science, 19(1), 12–17. https://doi.org/10.1111/j.1467-9280.2008.02038.x
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2–3), 61–83. https://doi.org/10.1017/s0140525x0999152x
Henry, P. J. (2008). College sophomores in the laboratory redux: Influences of a narrow data base on social psychology’s view of the nature of prejudice. Psychological Inquiry, 19(2), 49–71. https://doi.org/10.1080/10478400802049936
Hirschfeld, G., & Von Brachel, R. (2014). Improving multiple-group confirmatory factor analysis in R-A tutorial in measurement invariance with continuous and ordinal indicators. Practical Assessment, Research, and Evaluation, 19(1), 7. https://doi.org/10.7275/qazy-2946
Insel, T. R., Landis, S. C., & Collins, F. S. (2013). The NIH brain initiative. Science, 340(6133), 687–688. https://doi.org/10.1126/science.1239276
Israel B. A., Eng E., Schulz A. J., & Parker E. A. (2005). Introduction to methods in community-based participatory research for health. In Israel B. A., Eng E., Schulz A. J., & Parker E. A. (Eds.), Methods in community-based participatory research for health (Vol. 3, pp. 3–26). Hoboken: Jossey-Bass.
Jaiswal, J., & Halkitis, P. N. (2019). Towards a more inclusive and dynamic understanding of medical mistrust informed by science. Behavioral Medicine, 45(2), 79–85. https://doi.org/10.1080/08964289.2019.1619511
Kamphaus, R. W., Petoskey, M. D., & Rowe, E. W. (2000). Current trends in psychological testing of children. Professional Psychology: Research and Practice, 31(2), 155–164. https://doi.org/10.1037/0735-7028.31.2.155
Kaufman, N. K., Rohde, P., Seeley, J. R., Clarke, G. N., & Stice, E. (2005). Potential mediators of cognitive-behavioral therapy for adolescents with comorbid major depression and conduct disorder. Journal of Consulting and Clinical Psychology, 73(1), 38–46. https://doi.org/10.1037/0022-006X.73.1.38
Kinsella, G., Prior, M., Sawyer, M., Murtagh, D., Eisenmajer, R., Anderson, V., Bryan, D., & Klug, G. (1995). Neuropsychological deficit and academic performance in children and adolescents following traumatic brain injury. Journal of Pediatric Psychology, 20(6), 753–767. https://doi.org/10.1093/jpepsy/20.6.753
Kiselica, A. M., Johnson, E., Lewis, K. R., & Trout, K. (2021). Examining racial disparities in the diagnosis of mild cognitive impairment. Applied neuropsychology. Adult, 1–8. Advance online publication. https://doi.org/10.1080/23279095.2021.1976778
Kopp, B., Lange, F., & Steinke, A. (2019). The reliability of the Wisconsin Card Sorting Test in clinical practice. Assessment, 28(1), https://doi.org/10.1177/107319111986625
Koroshetz, W., Gordon, J., Adams, A., Beckel-Mitchener, A., Churchill, J., Farber, G., ... & White, S. (2018). The state of the NIH BRAIN initiative. Journal of Neuroscience, 38(29), 6427–6438.https://doi.org/10.1523/JNEUROSCI.3174-17.2018
Kranzler, J. H., Maki, K. E., Benson, N. F., Eckert, T. L., Floyd, R. G., & Fefer, S. A. (2020). How do school psychologists interpret intelligence tests for the identification of specific learning disabilities? Contemporary School Psychology, 24(4), 445–456. https://doi.org/10.1007/s40688-020-00274-0
Ku, W. E., Oliveira, J. S., Cook, N. E., McCurdy, K., Kavanaugh, B., Cancilliere, M. K., & Holler, K. A. (2020). Assessing performance validity with the TOMM and automatized sequences task in a pediatric psychiatric inpatient setting. Child Neuropsychology., 26(6), 801–816. https://doi.org/10.1080/09297049.2020.1712345
Lange, F. (2019). Are difficult-to-study populations too difficult to study in a reliable way? European Psychologist. https://doi.org/10.1027/1016-9040/a000384
Ledgerwood, A., Soderberg, C., & Sparks, J. (2017). Designing a study to maximize informational value. In J. Plucker & M. Makel (Eds.), Toward a more perfect psychology: Improving trust, accuracy, and transparency in research (pp. 33–58). American Psychological Association.
Manly, J. J. (2008). Critical issues in cultural neuropsychology: Profit from diversity. Neuropsychology Review, 18, 179–183. https://doi.org/10.1007/s11065-008-9068-8
Martin, P. K., Schroeder, R. W., & Odland, A. P. (2015). Neuropsychologists’ validity testing beliefs and practices: A survey of North American professionals. The Clinical Neuropsychologist, 29(6), 741–776. https://doi.org/10.1080/13854046.2015.1087597
Martin, P. K., Schroeder, R. W., Olsen, D. H., Maloy, H., Boettcher, A., Ernst, N., & Okut, H. (2019). A systematic review and meta-analysis of the Test of Memory Malingering in adults: Two decades of deception detection. The Clinical Neuropsychologist, 34(1), 88–119. https://doi.org/10.1080/13854046.2019.1637027
McCaffrey, R. J., Reynolds, C. R., Lynch, J. K., et al. (2023). Assessment of Cultural Bias on the PdPVTS Across Gender and Racial/Ethnic Groups. Journal of Pediatric Neuropsychology, 9, 18–28. https://doi.org/10.1007/s40817-022-00133-1
McCaffrey, R. J., Lynch, J. K., Leark, R. A., & Reynolds, C. R. (2020). Pediatric performance validity test suite: Technical manual. Multi-Health Systems Inc.
Medina, L. D., Torres, S., Gioia, A., Lopez, A. O., Wang, J., & Cirino, P. T. (2021). Reporting of demographic variables in neuropsychological research: An update of O’Bryant et al.’s trends in the current literature. Journal of the International Neuropsychological Society, 27(5), 497–507. https://doi.org/10.1017/S1355617720001083
Millsap, R. E. (2010). Testing measurement invariance using item response theory in longitudinal data: An introduction. Child Development Perspectives, 4(1), 5–9. https://doi.org/10.1111/j.1750-8606.2009.00109.x
Monk, R., Whitehouse, A. J., & Waddington, H. (2022). The use of language in autism research. Trends in Neurosciences, 45(11). https://doi.org/10.1016/j.tins.2022.08.009
Mook, D. G. (1983). In defense of external invalidity. American Psychologist, 38(4), 379–387. https://doi.org/10.1037/0003-066X.38.4.379
Morgan, P. L., Staff, J., Hillemeier, M. M., Farkas, G., & Maczuga, S. (2013). Racial and ethnic disparities in ADHD diagnosis from kindergarten to eighth grade. Pediatrics, 132(1), 85–93. https://doi.org/10.1542/peds.2012-2390
Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman, H., Urry, H. L., Forscher, P. S., ... & Chartier, C. R. (2018). The psychological science accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501-515. https://doi.org/10.1177/2515245918797607
Myers, M. (2000). Qualitative research and the generalizability question: Standing firm with Proteus. The Qualitative Report, 4(3). https://doi.org/10.46743/2160-3715/2000.2925.
National Institutes of Health. (2022, October 11). Inclusion of women and minorities as participants in research involving human subjects. U.S. Department of Health and Human Services. Retrieved April 26, 2023 from: https://grants.nih.gov/policy/inclusion/women-and-minorities.htm
Nazaribadie, M., Ghaleiha, A., Ahmadpanah, M., Mazdeh, M., Matinnia, N., & Zarabian, M. K. (2021). Metacognitive model of mindfulness can improve executive function in multiple sclerosis patients. Pakistan Journal of Medical and Health Sciences, 15(1), 590–597.
Nelson, C. M., & Oakes, L. M. (2021). “May i grab your attention?”: An investigation into infants’ visual preferences for handled objects using Lookit as an online platform for data collection. Frontiers in Psychology, 12, 733218. https://doi.org/10.3389/fpsyg.2021.733218
Nijdam-Jones, A., Rivera, D., Rosenfeld, B., & Arango-Lasprilla, J. C. (2019). The effect of literacy and culture on cognitive effort test performance: An examination of the test of memory malingering in Colombia. Journal of Clinical and Experimental Neuropsychology, 41(10), 1015–1023. https://doi.org/10.1080/13803395.2019.1644294
O’Bryant, S. E., Johnson, L. A., Barber, R. C., Braskie, M. N., Christian, B., Hall, J. R., Hazra, N., King, K., Kothapalli, D., Large, S., Mason, D., Matsiyevskiy, E., McColl, R., Nandy, R., Palmer, R., Petersen, M., Philips, N., Rissman, R. A., Shi, Y., Toga, A. W., … HABLE Study Team (2021). The Health & Aging Brain among Latino elders (HABLE) study methods and participant characteristics. Alzheimer’s & dementia, 13(1), e12202. https://doi.org/10.1002/dad2.12202
Okazaki, S., & Sue, S. (1998). Methodological issues in assessment research with ethnic minorities. Readings in ethnic psychology, 26–40. https://doi.org/10.1037/1040-3590.7.3.367
Parsons, S., Kruijt, A. W., & Fox, E. (2019). Psychological science needs a standard practice of reporting the reliability of cognitive-behavioral measurements. Advances in Methods and Practices in Psychological Science, 2(4), 378–395. https://doi.org/10.1177/251524591987969
Pearson. (2018). WISC-V efficacy research report. Retrieved April 26, 2023 from: https://www.pearson.com/content/dam/one-dot-com/one-dot-com/global/Files/efficacy-and-research/reports/efficacy-assessment-reports/WISC-V-research-report.pdf
Pedraza, O., Edgar, V. B., & Romero, R. A. (2020). In Clinical cultural neuroscience an integrative approach to cross-cultural neuropsychology (pp. 215–232). Essay, Oxford University Press
Possin, K. L., Tsoy, E., & Windon, C. C. (2021). Perils of race-based norms in cognitive testing. JAMA Neurology, 78(4), 377. https://doi.org/10.1001/jamaneurol.2020.4763
Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. https://doi.org/10.1016/j.dr.2016.06.004
Raad, J. M., Bellinger, S., McCormick, E., Roberts, M. C., & Steele, R. G. (2008). Brief report: Reporting practices of methodological information in four journals of pediatric and child psychology. Journal of Pediatric Psychology, 33(7), 688–693. https://doi.org/10.1093/jpepsy/jsm130
Rees, L. M., Tombaugh, T. N., Gansler, D. A., & Moczynski, N. P. (1998). Five validation experiments of the Test of Memory Malingering (TOMM). Psychological Assessment, 10(1), 10–20. https://doi.org/10.1037/1040-3590.10.1.10
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114(3), 552–566. https://doi.org/10.1037/0033-2909.114.3.552
Remedios, J. D. (2022). Psychology must grapple with Whiteness. Nat Rev Psychol, 1, 125–126. https://doi.org/10.1038/s44159-022-00024-4
Requena-Komuro, M. C., Jiang, J., Dobson, L., Benhamou, E., Russell, L., Bond, R. L., ... & Hardy, C. J. (2022). Remote versus face-to-face neuropsychological testing for dementia research: A comparative study in people with Alzheimer’s disease, frontotemporal dementia and healthy older individuals. BMJ open, 12(11), e064576. https://doi.org/10.1136/bmjopen-2022-064576
Richeson, J. A., & Trawalter, S. (2005). Why do interracial interactions impair executive function? A resource depletion account. Journal of Personality and Social Psychology, 88(6), 934–947. https://doi.org/10.1037/0022-3514.88.6.934
Rivera, D., Perrin, P. B., Weiler, G., Ocampo-Barba, N., Aliaga, A., Rodríguez, W., Rodríguez-Agudelo, Y., Aguayo, A., Longoni, M., Trapp, S., Esenarro, L., & Arango-Lasprilla, J. C. (2015). Test of Memory Malingering (TOMM): Normative data for the Latin American Spanish speaking adult population. NeuroRehabilitation, 37(4), 719–735. https://doi.org/10.3233/nre-151287
Rivera-Mindt, M., Byrd, D., Saez, P., & Manly, J. (2010). Increasing culturally competent neuropsychological services for ethnic minority populations: A call to action. The Clinical Neuropsychologist, 24(3), 429–453. https://doi.org/10.1080/13854040903058960
Roberts, S. O., Bareket-Shavit, C., Dollins, F. A., Goldie, P. D., & Mortenson, E. (2020). Racial inequality in psychological research: Trends of the past and recommendations for the future. Perspectives on Psychological Science, 15(6), 1295–1309. https://doi.org/10.1177/1745691620927709
Roberts, S. O., & Mortenson, E. (2022). Challenging the White= neutral framework in psychology. Perspectives on Psychological Science [advance online publication] https://doi.org/10.1177/17456916221077117
Saban, W., & Ivry, R. B. (2021). PONT: A protocol for online neuropsychological testing. Journal of Cognitive Neuroscience, 33(11), 2413–2425. https://doi.org/10.1162/jocn_a_01767
Salinas, C. M., Bordes Edgar, V., Berrios Siervo, G., & Bender, H. A. (2020). Transforming pediatric neuropsychology through video-based teleneuropsychology: An innovative private practice model pre-COVID-19. Archives of Clinical Neuropsychology, 35(8), 1189–1195. https://doi.org/10.1093/arclin/acaa101
Salter, P., & Adams, G. (2013). Toward a critical race psychology. Social and Personality Psychology Compass, 7(11), 781–793. https://doi.org/10.1111/spc3.12068
Schatz, P., Jay, K. A., McComb, J., & McLaughlin, J. R. (2005). Misuse of statistical tests in Archives of Clinical Neuropsychology publications. Archives of Clinical Neuropsychology, 20(8), 1053–1059. https://doi.org/10.1016/j.acn.2005.06.006
Schneider, H. E., Kirk, J. W., & Mahone, E. M. (2014). Utility of the Test of Memory Malingering (TOMM) in children ages 4–7 years with and without ADHD. The Clinical Neuropsychologist, 28(7), 1133–1145. https://doi.org/10.1080/13854046.2014.960004
Schoen, J. (2005). Choice and coercion: Birth control, sterilization, and abortion in public health and welfare. University of North Carolina Press.
Schrank, F. A., McGrew, K. S., & Mather, N. (2014). Woodcock-Johnson Tests of Cognitive Abilities (4th ed.). Riverside Publishing.
Schroeder, R. W., Martin, P. K., & Odland, A. P. (2016). Expert beliefs and practices regarding neuropsychological validity testing. The Clinical Neuropsychologist, 30(4), 515–535. https://doi.org/10.1080/13854046.2016.1177118
Scott, K., & Schulz, L. (2017). Lookit (part 1): A new online platform for developmental research. Open Mind, 1(1), 4–14. https://doi.org/10.1162/OPMI_a_00002
Sears, D. O. (1986). College sophomores in the laboratory: Influences of a narrow data base on social psychology’s view of human nature. Journal of Personality and Social Psychology, 51(3), 515. https://doi.org/10.1037/0022-3514.51.3.515
Settles, I. H., Warner, L. R., Buchanan, N. T., & Jones, M. K. (2020). Understanding psychology’s resistance to intersectionality theory using a framework of epistemic exclusion and invisibility. Journal of Social Issues, 76(4), 796–813. https://doi.org/10.1111/josi.12403
Sheskin, M., & Keil, F. (2018). TheChildLab.com a video chat platform for developmental research. PsyArXiv [Preprint]. https://doi.org/10.31234/osf.io/rn7w5
Sheskin, M., Scott, K., Mills, C. M., Bergelson, E., Bonawitz, E., Spelke, E. S., ... & Schulz, L. (2020). Online developmental science to foster innovation, access, and impact. Trends in Cognitive Sciences, 24(9), 675–678. https://doi.org/10.1016/j.tics.2020.06.004
Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/17456916177086
Smedley, A., & Smedley, B. D. (2005). Race as biology is fiction, racism as a social problem is real: Anthropological and historical perspectives on the social construction of race. American Psychologist, 60(1), 16–26. https://doi.org/10.1037/0003-066X.60.1.16
Souissi, S., Chamari, K., & Bellaj, T. (2022). Assessment of executive functions in school-aged children: A narrative review. Frontiers in Psychology, 13. https://doi.org/10.3389/fpsyg.2022.991699
Spano, P., Katz, N., DeLuco, T., Martin, C. O., Tam, H., Montalto, D., & Stein, C. R. (2021). Parent perceptions of pediatric neuropsychological evaluations: A systematic review. Child Neuropsychology, 27(7), 922–948. https://doi.org/10.1080/09297049.2021.1908980
Spates, K. (2012). “The missing link” the exclusion of Black women in psychological research and the implications for Black women’s mental health. SAGE Open, 2(3), 2158244012455179.
Steinmetz, H. (2013). Analyzing observed composite differences across groups: Is partial measurement invariance enough? Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 9(1), 1–12. https://doi.org/10.1027/1614-2241/a000049
Terrell, F., Terrell, S. L., & Taylor, J. (1981). Effects of race of examiner and cultural mistrust on the WAIS performance of Black students. Journal of Consulting and Clinical Psychology, 49(5), 750–751. https://doi.org/10.1037/0022-006X.49.5.750
Tombaugh, T. N. (1996). TOMM: Test of Memory Malingering. Multi-health systems
Tuokko, H. A., Chou, P. H. B., Bowden, S. C., Simard, M., Ska, B., & Crossley, M. (2009). Partial measurement equivalence of French and English versions of the Canadian Study of Health and Aging neuropsychological battery. Journal of the International Neuropsychological Society, 15(3), 416–425. https://doi.org/10.1017/S1355617709090602
van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9(4), 486–492. https://doi.org/10.1080/17405629.2012.686740
Visser, I., Bergmann, C., Byers-Heinlein, K., Dal Ben, R., Duch, W., Forbes, S., ... & Schuwerk, T. & Zettersten, M. (2022). Improving the generalizability of infant psychological research: The ManyBabies model. Behavioral and Brain Sciences, 45, e35. https://doi.org/10.1017/S0140525X21000455
Wasserman, J. D. (2018). A history of intelligence assessment: The unfinished tapestry. In D. P. Flanagan & E. M. McDonough (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (pp. 3–55). The Guilford Press.
Wechsler D (2014) WISC-V: Technical and interpretive manual. Bloomington, MN: Pearson
Wisdom, N. M., Brown, W. L., Chen, D. K., & Collins, R. L. (2012). The use of all three Test of Memory Malingering trials in establishing the level of effort. Archives of Clinical Neuropsychology, 27(2), 208–212. https://doi.org/10.1093/arclin/acr107
Yeates, K. O. (2022). Toward a more open and diverse science in neuropsychology [Editorial]. Neuropsychology, 36(1), 1–3. https://doi.org/10.1037/neu0000790
Yin, P., & Fan, X. (2000). Assessing the reliability of Beck Depression Inventory scores: Reliability generalization across studies. Educational and Psychological Measurement, 60(2), 201–223. https://doi.org/10.1177/00131640021970466
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Malik, H.B., Norman, J.B. Best Practices and Methodological Strategies for Addressing Generalizability in Neuropsychological Assessment. J Pediatr Neuropsychol 9, 47–63 (2023). https://doi.org/10.1007/s40817-023-00145-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40817-023-00145-5