Online surveys represent one of the most widely used, efficient, and valuable methods for gathering psychological data across many different research contexts. Despite the many benefits afforded by surveys, research shows that a nontrivial number of people who complete surveys engage in careless responding (CR) such that their responses are neither accurate nor valid (e.g., Arthur et al., 2020; DeSimone & Harms, 2018; Meade & Craig, 2012). CR can distort research/survey findings, yield false theoretical conclusions, and lead to misinformed data-driven decisions within numerous applied settings. Consequently, failing to understand and account for the effects of CR can be harmful for researchers and practitioners alike.

With surveys, it is common to model responses with latent variable techniques—such as confirmatory factor analysis (CFA) and item response theory (IRT)—and use model fit indices to infer how well the statistical model represents/aligns with the underlying data. Unfortunately, the research to date that has been conducted on the relation between CR and fit has yielded inconsistent findings. For example, some research shows that CR deteriorates model fit (Arias et al., 2020; Huang et al., 2012; Woods, 2006), while other research shows that the effects of CR on fit are somewhat negligible (Kam, 2019; Kam & Meyer, 2015; Schneider et al., 2018; Steedle et al., 2019), and still other research shows that CR may sometimes improve fit (Goldammer et al., 2020; see Beck et al., 2019 and Liu et al., 2019 for further illustrations of these variable effects). It is difficult to draw a firm conclusion about the CR–fit linkage, however, given that existing research has employed different research designs, definitions of CR, model fit indices, and other study features (e.g., sample sizes, substantive scales). Despite some of these initial—albeit contradictory and highly variable—research findings, many important questions about the relation between CR and model fit remain unanswered: Does CR deteriorate, improve, or not impact model fit? If CR does impact fit, in which situations are these effects largest? Are such effects impacted by the nature of CR (e.g., the type of CR) or other study characteristics (e.g., sample size)? Moreover, does the impact of CR on fit depend on the latent variable method (e.g., CFA vs. IRT) or model fit index (e.g., χ2 vs. root mean square error of approximation [RMSEA] vs. standardized root mean square residual [SRMSR]) that is used? Is it possible for models with “good” fit (e.g., RMSEA < .08) to be laden with CR? If so, what implications does this have for model interpretation and validity?

To answer each of these important questions, a comprehensive simulation study was conducted. A total of 144 conditions (which varied sample size, number of items, CR prevalence, CR severity, and CR type), two latent variable models (IRT and CFA), and six model fit indices (χ2, RMSEA, and SRMSR [for CFA models] and M2, RMSEA, and SRMSR [for IRT models]) were examined. To embed realism into the simulation (see Harwell et al., 1996), an initial study (study 1) was conducted whereby participants’ response behaviors were experimentally shaped. These responses and associated parameters were then used to generate empirically informed data (i.e., responses) for portions of the main simulation study (study 2).

Defining careless responding

CR, also known as “insufficient effort responding” (IER) (e.g., Huang et al., 2012) or “random responding” (e.g., Credé, 2010), refers to when participants complete surveys with little accuracy or regard to item content such that their responses are not indicative of their true standing on the underlying latent construct being assessed, thus undermining the validity of the measure (Borsboom et al., 2004; Meade & Craig, 2012). Estimates of the prevalence of CR vary widely and may range anywhere from 2% to 50% of a sample (e.g., Hong et al., 2020; Meade & Craig, 2012) depending on how CR is measured and what thresholds are utilized (DeSimone & Harms, 2018). CR can be distinguished from other survey response behaviors such as satisficing, faking, and response styles. For instance, survey satisficing occurs when people use minimal, but still sufficient, effort to respond to survey items (Krosnick, 1991). Moreover, while faking (e.g., Birkeland et al., 2006) also yields inaccurate information about a respondent’s true standing on a construct, this type of responding is very effortful as respondents are attempting to convey a certain level of the construct. Lastly, response styles (e.g., Weijters et al., 2008) refer to a tendency for someone to endorse a specific scale range (e.g., tending to agree with items), and this may or may not be indicative of CR since people could still be responding effortfully.

To provide further definitional clarity and ensure conceptual rigor (see Podsakoff et al., 2016), CR can be contrasted with Tourangeau et al. (2000) cognitive theory of optimal survey responding. This prominent theory of survey responding states that respondents progress through a series of steps when responding to survey items. Respondents first comprehend the item (step 1), generate a retrieval strategy/extract the required information (step 2), map their judgment onto the response category based on the information extracted from the previous step (step 3), and then finally provide a response to the item (step 4). If all steps are sufficiently enacted, respondents are employing an optimizing strategy, whereas if steps are not sufficiently enacted (e.g., respondents progress through steps too rapidly or skip some steps entirely), participants are using a satisficing strategy. While responding in any of these ways will yield valid responses, validity is highest when all steps are enacted.

When respondents engage in CR, however, responses contain little to no validity. A precise specification of CR and how this may be conceptually integrated with Tourangeau et al.’s (2000) cognitive theory of survey response, survey satisficing, and response validity is presented in Fig. 1Footnote 1. CR occurs when too many cognitive steps are bypassed but respondents nonetheless provide responses to survey items. In the context of Tourangeau et al.’s (2000) theory, this could entail bypassing the comprehension step (step 1) altogether. This aligns with the view that CR occurs when people complete surveys without regard to item content (Meade & Craig, 2012). In such cases, responses likely maintain no degree of validity since they are not caused by the underlying construct (Borsboom et al., 2004). Furthermore, it may also be possible for respondents to tentatively enact step 1, but then for various reasons (e.g., boredom) leave this step early and respond carelessly. Such responses would possess very low validity since the cognitive steps were insufficiently enacted to produce a minimally valid response. These two possibilities are illustrated by the bottom (dashed) arrows in Fig. 1. Further support for these considerations can be derived from studies showing that careless responders often respond to surveys quicker than careful responders (Bowling et al., 2021b; Wood et al., 2017). Presumably, neglecting the cognitive steps needed for optimal responding decreases the time that it takes to respond to survey items as fewer steps are being enacted. As noted above, respondents may proceed through all cognitive stages in a suboptimal manner or skip the latter stages altogether, which according to Krosnick (1991) would still represent a satisficing strategy since responses maintain minimal validity. Responses are most valid, however, when respondents proceed through all cognitive steps (as indicated in the arrows in the upper portion of Fig. 1).

Fig. 1
figure 1

Conceptual integration of CR with Tourangeau et al.’s (2000) theory of survey responding. The solid arrows/paths specify either an optimizing or satisficing strategy while the dotted arrows/paths specify CR. Respondents may fully enact all steps (as indicated by the horizontal arrows/paths connecting the steps/boxes), exit any of the steps early (as indicated by the vertical arrows/paths that deviate from the steps), or skip steps altogether (as indicated by the various arrows/paths that skip steps 1–3 and go straight to step 4 [response])

Statistical and psychometric impact of careless responding

This sort of suboptimal (i.e., careless) responding can distort research results in numerous ways. For instance, various studies have demonstrated that, depending on certain features of a scale (e.g., the scale midpoint), small amounts of CR can either augment or attenuate the effect size of the relationship between substantive variables (e.g., Credé, 2010; Holden et al., 2019; Huang et al., 2015; Kam & Meyer, 2015; Maniaci & Rogge, 2014) as well as the relationship between surveys and objective measures (e.g., Huang & Desimone, 2021). This occurs for correlation coefficients (Huang et al., 2015), regression coefficients (Maniaci & Rogge, 2014), and when relations are estimated via structural equation modeling frameworks (Kam & Meyer, 2015). This, in turn, can impact the significance of the relationship between variables and consequently increase or decrease type I and type II error rates. CR can also distort other psychometric results, such as factor loading estimates (Kam, 2019; Kam & Meyer, 2015), inter-item correlations (DeSimone et al., 2018), eigenvalues (DeSimone et al., 2018; Huang et al., 2012), and survey reliability estimates (e.g., Cronbach’s alpha; DeSimone et al., 2018).

Initial research also shows that CR affects model fit indices, though these effects are somewhat variableFootnote 2. For instance, some research has shown that when careless responders are screened and removed from a dataset, model fit improves (Arias et al., 2020; Huang et al., 2012). Within simulation contexts, studies have also shown that the presence of CR can deteriorate model fit under certain circumstances (Woods, 2006). Other studies, however, have found that the effects of CR on fit are negligible (Beck et al., 2019; Kam & Meyer, 2015; Kam, 2019; Schneider et al., 2018; Steedle et al., 2019), even when CR prevalence is high (e.g., Liu et al., 2019). In contrast, other research has found that when careless responders are compared to careful responders, model fit can be better for the careless responders (Goldammer et al., 2020). Because existing research has relied on different definitions of CR, CR indices, research designs, and fit indices, it is challenging to derive firm conclusions about the effects that CR has on fit.

The lack of knowledge about the CR–fit linkage is problematic given that model fit indices are crucial for evaluating the quality of latent variable models (e.g., CFA, IRT; Greiff & Heene, 2017; Hu & Bentler, 1999; Kline, 2016). For example, good-fitting models may contain invalid (i.e., careless) responses and poor-fitting models may contain valid (i.e., careful) responses. The former situation indicates that good fit is no guarantee that one’s results are meaningful. This is especially problematic when researchers and practitioners assume that good model fit is indicative of high-quality data, without considering the impact of CR. If model fit is positively related to CR (i.e., as CR increases, fit improves), this would make the model appear better than it really is, since fit would not account for the careless responses underlying the model. On the other hand, if CR and model fit are negatively related (i.e., as CR increases, fit deteriorates), this would suggest that CR contributes to poor fit, in addition to the typical reasons for poor fit (e.g., model misspecification) and provide another compelling reason to address CR. Finally, CR and fit may be unrelated. If this is the case, this would suggest that fit indices are not a good proxy for CR and nor are CR indices good proxies for model fit.

Study overview

In summary, because a nontrivial number of respondents engage in CR and because model fit information is essential for making meaningful inferences about the appropriateness of latent variable models, knowledge of how CR effects fit is essential for both theory and practice. Accordingly, the primary goal of this study is to systematically examine how CR affects model fit via a realistic, rigorous, and comprehensive simulation study. A simulation method was used since this method is well-suited for situations when a high degree of control is needed, where the true properties of participants/measures are difficult to know (e.g., knowledge of who is truly engaging in CR, which is never known with certainty) and when many manipulations are needed to understand the effects of a phenomenon (see Harwell et al., 1996). Another benefit of simulations is that replications can be used, which can avoid confounds that can impact single-study designs (e.g., sampling error). In the context of this study, a simulation method can be used to confidently specify the precise effects that CR has on model fit—something that would not be possible using a single-study design. To ensure the results of the present study are realistic and generalizable, an initial study (study 1), whereby participants’ response behaviors were experimentally shaped, was first conducted. The goal of this initial study was to embed the subsequent simulation study with a high degree of realism and ensure robust and generalizable findings that reflect commonly encountered survey situations.

Study 1

Participants

A total of 1346 participants from Amazon’s Mechanical Turk (MTurk) platform completed the study 1 survey and were compensated $0.75. Approximately 300 participants per condition were desired to properly conduct the analyses needed to inform the simulation study. Because it was anticipated that some participants would not adhere to the study instructions, the number of participants recruited for each condition was slightly oversampled (the manipulations and data cleaning procedures are described in detail below). Of these participants, 52% identified as men/male, 73% identified as white/Caucasian, 56% had a bachelor’s degree or higher, 70% were employed full-time, and the average age was 37 (SD = 11.26).

Design and procedures

Participants were randomly assigned to one of four conditions: (1) a careful responding condition, (2) careless responding condition, (3) random responding condition, or (4) control condition. Each of these experimental conditions contained instructions to respond either carefully, carelessly, or randomly. The control condition did not contain any instructions to respond in a particular manner. Separate conditions were included for random and careless responding to determine whether these different terms result in unique response behaviors. Similar to previous research examining strategies for reducing CR (e.g., Huang et al., 2012), participants in the careful condition were asked to respond as accurately as possible and warned that sophisticated statistical methods would be used to flag low-quality data. Participants were also reminded of the importance of having high-quality data for this study and asked to check a box that stated, “I agree to complete the following survey as accurately as possible.” Next, and in alignment with established methods for eliciting response behaviors (e.g., Zickar et al., 2004), participants in the careless and random conditions were instructed to respond as carelessly and randomly as possible, respectively (see supplemental materials for the experimental stimuli used within each condition).

For the primary survey, 60 conscientiousness items from the 300-item NEO International Item Personality Pool (IPIP; Goldberg et al., 2006) were used. This particular scale contains six facets and consists of a mixture of reverse-coded and normally-worded items. A five-point, Likert-type format ranging from 1 (Very Inaccurate) to 5 (Very Accurate) was used for all items. Two directed-response items (i.e., “Please select ‘Very Accurate’ for this item” and “Please select ‘Moderately Inaccurate’ for this item”) were embedded into the survey in the same location for all four conditions to provide a way to assess the effectiveness of the manipulations. After completing the main survey, all participants saw a separate screen where they were instructed to respond to all subsequent questions as accurately as possible. Participants then completed two other manipulation-check items. For the first item, participants were asked to indicate how accurately they completed the previous survey (i.e., “To what extent were your responses accurate descriptions of you?”). For the second item, participants saw a brief definition of conscientiousness and were then asked to indicate their level of conscientiousness based on this definition. Responses to both these items were made on a sliding bar ranging from 0 to 100, with higher scores indicating higher accuracy and conscientiousness. This alternative format (i.e., sliding bar) was used to differentiate these items from the main survey items (i.e., the manipulations) and help facilitate accurate responses. A summary of hypothesized results for the various manipulation checks is provided in Table 1.

Table 1 Summary of the expected results for the manipulation checks

Results

Manipulation checks

To determine whether the manipulations were successful, the mean response accuracy ratings, correlation between the two measures of conscientiousness, and percentage of directed response items answered incorrectly were calculated for each condition. These results are summarized in Table 2. First, to compare accuracy ratings across all four conditions, a one-way analysis of variance (ANOVA) was conducted, with “condition” serving as the independent variable (IV) and “accuracy” (i.e., the responses to the “accuracy” sliding bar described above) serving as the dependent variable (DV). The result of this analysis was significant, F(3, 1341) = 367.12, p < .001, indicating that there were differences in accuracy ratings across the four groups. To follow up on this, a Tukey honestly significant difference (HSD) post hoc test was conducted. This analysis revealed that participants in the careful condition had higher accuracy ratings (M = 96.04) than those in either the careless (M = 39.66, p < .01) or random (M = 39.23, p <.01) conditions. The results also indicated that those in the control condition had higher accuracy ratings (M = 94.73) than those in the careless (M = 39.66, p < .01) or random (M = 39.23, p <.01) conditions. Next, the correlation between the NEO-IPIP conscientiousness measure and the single item conscientiousness sliding bar was computed for all four conditions. Overall, the correlations were large for both the careful (r = .58, p < .01) and control (r = .57, p < .01) conditions. In contrast, the correlations were small for the careless (r = .11, p < .05) and random (r = .21, p < .01) conditions. These correlation coefficients were formally compared using a series of Fisher r-to-z tests. The results of this analysis indicated that the correlation for the careful condition was significantly different than the correlation for the careless condition (Z = 7.12, p < .001) and the random condition (Z = 5.77, p < .001). Finally, the number of directed-response items answered incorrectly was computed for each condition. The results indicated that those in the careful and control conditions answered the same number of directed-response items incorrectly (item 1 = 2%; item 2 = 5%). Those in the careless and random conditions answered far more of these directed-response items incorrectly (careless: item 1 = 41%, item 2 = 40%; random: item 1 = 40%; item 2 = 39%).

Table 2 Summary of the results for the manipulation checks

While the results of these analyses provide evidence that the manipulations were largely successful, the results also indicated that not all participants responded in the intended manner. Thus, to increase confidence in these findings and ensure each condition only contained responses of participants who adhered to the instructions, participants who clearly did not adhere to the instructions were screened out. Specifically, participants with accuracy ratings ≤ 85 or who answered either directed-response item incorrectly were removed from the careful/control condition. Also, participants with an accuracy rating of 100 were removed from the careless/random conditionsFootnote 3. As seen in Table 2, the post-cleaning pattern of results was very similar to the pre-cleaning pattern of results. That is, all findings were significant before and after cleaning and the effect sizes were highly similar pre/post cleaning. The slight changes in results can be attributed to the fact that respondents who did not follow the instructions were removed, which helps increase confidence in the appropriateness of the subsequent analyses.

Item response theory analysis

To derive parameters to inform the primary simulation (study 2)—which relies on an IRT-based simulation method—a series of IRT analyses were first conducted on the cleaned responses (Embretson & Reise, 2000). Because the following study used a polytomous rating format and item response options are assumed to be ordinal and monotonic, the graded response model (GRM; Samejima, 1969), which is well-suited for assessing Likert-type data, was utilized.

Formally, the GRM is defined as:

$${P}_{ix}^{\ast}\left({\theta}_j\right)=\frac{\mathit{\exp}\left[{\alpha}_i\left({\theta}_j-{b}_{ij}\right)\right]}{1+\mathit{\exp}\left[{\alpha}_i\left({\theta}_j-{b}_{ij}\right)\right]}$$
(1)

\({P}_{ix}^{\ast}\left({\theta}_j\right)\) represents the probability of a person with a given theta (θ) level endorsing a response option within a given option boundary where x is the number of response options, αi denotes the discrimination parameter for item i, and bi denotes the category threshold parameters for item i. Formally, the probability of responding for each response option category, which is often referred to as a category response curve (CRC), is defined as:

$${P}_{ix}\left({\theta}_j\right)={P}_{ix}^{\ast}\left({\theta}_j\right)-{P}_{i\left(x+1\right)}^{\ast}\left({\theta}_j\right)$$
(2)

For a five-point, Likert-type scale, this can be further expressed as for response option:

$${P}_{i1}\left({\theta}_j\right)=1-{P}_{i1}^{\ast}\left({\theta}_j\right)$$
(3)
$${P}_{i2}\left({\theta}_j\right)={P}_{i1}^{\ast }-{P}_{i2}^{\ast}\left({\theta}_j\right)$$
(4)
$${P}_{i3}\left({\theta}_j\right)={P}_{i2}^{\ast }-{P}_{i3}^{\ast}\left({\theta}_j\right)$$
(5)
$${P}_{i4}\left({\theta}_j\right)={P}_{i3}^{\ast }-{P}_{i4}^{\ast}\left({\theta}_j\right)$$
(6)
$${P}_{i5}\left({\theta}_j\right)={P}_{i4}^{\ast }-0$$
(7)

Prior to deriving the final set of parameters to inform the simulation study, the fit of the IRT models was estimated (see Nye et al., 2019) via the following fit indices: M2, RMSEA, and SRMSR. These findings were supplemented by estimating a series of CFA models via maximum likelihood, which assumed continuous-like (i.e., ordinal) responses. The following CFA fit indices were examined: χ2, RMSEA, and SRMSR. Unidimensional models were estimated for both the IRT and CFA analyses. The initial results indicated that the models did not demonstrate acceptable fit. An inspection of the source(s) of model misfit revealed that the reverse-coded items resulted in model misspecification, which is a well-documented issue when estimating latent variable models (e.g., Spector et al., 1997). Because there are methodological reasons for the removal of these items (Kline, 2016) and given that the goal of these analyses is to derive useful parameter and θ estimates to inform the subsequent simulation study, all models were estimated without the reverse-coded items from the IPIP-NEO conscientiousness measure. Model fit for the careful group demonstrated marginally acceptable fit for both the IRT (M2 = 1275.41, RMSEA = .10, SRMSR = .09) and CFA (χ2 = 1662.00, RMSEA = .10, SRMSR = .09) analyses. The fit of the control condition was very similar to the fit of the careful condition (M2 = 1206.94, RMSEA = .10, SRMSR = .09 [IRT] and χ2 = 1628.53, RMSEA = .10, SRMSR = .08 [CFA]). While the fit of these models is not optimal, it is important to note that the goal of the IRT analysis was not to derive excellent model fit per se, but to instead ensure that the models displayed sufficiently acceptable fit so as to produce a set of useful and realistic parameters (i.e., parameters indicative of specific response sets [careful, careless, random responding]) to inform aspects of the subsequent simulation (see Huang et al., 2012 for a similar argument of CR/fit in a different context)Footnote 4. Upon estimating a series of acceptably fitting unidimensional models, the a and bi parameters (Table 3) as well as the b-to-b distances (which were used in the study 2 simulation) were computed for each condition (see the supplemental materials for b-to-b distance values and all scale-level and item-level estimates for all conditions).

Table 3 Summary of IRT a and bi parameters for all four conditions

Study 2

The goal of study 1 was to derive parameters that could be used to generate specific response behaviors (i.e., careful, careless, and random responding) and embed aspects of the primary simulation with realism. As noted above, the goal of study 2 was to comprehensively examine the effects of CR on model fit across 144 unique conditions (which varied sample size, number of items, CR prevalence, CR severity, and CR type), two latent variable models (IRT and CFA), and six commonly used model fit indices (χ2, RMSEA, and SRMSR [for CFA models] and M2, RMSEA, and SRMSR [for IRT models]).

Method

Summary of conditions

This study manipulated five different IVs. These IVs were selected to be realistic and representative of situations that researchers and practitioners commonly encounter (Harwell et al., 1996). Moreover, the various levels of the IVs were selected due to their realism and ability to ensure the generalizability of the simulation findings across a range of frequently encountered survey contexts. These manipulations are described in detail below.

The first IV is sample size. Specifically, sample sizes of 200 and 500 were examined. These levels were selected for multiple reasons. First, because IRT generally requires large samples of up to 500 participants (Embretson & Reise, 2000), knowledge of how CR impacts model fit in both suboptimal (e.g., 200) and sufficient (e.g., 500) conditions is important. Also, sample sizes of 200 and 500 are often the values selected to represent small and moderate/large sample sizes in CR simulations (Woods, 2006). Furthermore, a sample size of 200 is also just slightly higher than the median sample size that is commonly observed in organizational studies (e.g., Shen et al., 2011). Examining sample sizes of 200 and 500 is, therefore, consistent with this study’s goal of examining realistic, frequently encountered scenarios.

The second IV was number of items. Specifically, surveys with 10 and 60 items were examined. Examining the relation between CR and model fit on surveys with varying numbers of items is important given the variability in the number of items contained within psychological measures. Furthermore, because CR has been found to sometimes increase toward the end of surveys (e.g., Bowling et al., 2021a; Gibson & Bowling, 2019), knowledge of how CR affects model fit for both short (no. of items = 10) and long (no. of items = 60) surveys is important for examining the generalizability of the effects of CR on model fit.

The next IV examined was “CR prevalence” (5%, 15%, 30%), which refers to the percentage of the sample engaging in CR. Typical estimates of the prevalence of CR range from 10% to 12% (Meade & Craig, 2012; but see also DeSimone et al., 2018). Hence, a value of “15%” roughly corresponds to the usual prevalence of CR in a sample. The values of “5%” and “30%,” therefore, correspond to low and high rates of CR, respectively. These values also roughly align with the values that have been employed in other CR simulations (e.g., Hong et al., 2020).

The fourth IV examined was “CR severity” (20%, 50%, 100%), which refers to the percentage of a respondent’s responses that are careless. While it is often assumed that respondents will engage in CR across an entire survey, it is more realistic that respondents engage in partial CR (Meade & Craig, 2012). Support for this assertion comes from research showing that CR sometimes increases toward the latter portions of surveys, especially for long surveys (Bowling et al., 2020; Gibson & Bowling, 2019). For this IV, a respondent with 100% CR severity is engaging in CR across their entire response vector, a respondent with 50% CR severity is only engaging in CR for the final 50% of their response vector, and a respondent with 20% CR severity is engaging in CR for just the final 20% of their response vector.

The final IV, “CR type” (respondent-derived careless responding [CRC], respondent-derived random responding [RRC], mathematical random responding [MRRC], invariant responding [IRC]), refers to the form of the CR.Footnote 5 The first two levels of this variable (i.e., RRC, CRC) reflect the response patterns that participants were instructed to exhibit in the “random” and “careless” conditions in study 1. The next level of this variable, MRRC refers to mathematically random responding and follows a uniform distribution. Given that people often struggle to recognize and engage in true, mathematically random behavior (e.g., Nickerson, 2002), this condition was included to compare participants’ interpretation and enactment of random responding (i.e., the RRC condition) to true random responding. The last level of this variable, invariant responding (IRC), refers to a form of CR where participants select the same response option consecutively (e.g., selecting “3” for all items).

When simulating CR, certain methodological tradeoffs are required as no simulation method is capable of perfectly representing true CR (Curran & Denison, 2019; Schroeders et al., 2022). For example, it is possible that participants’ CR behavior does not completely mirror induced CR behavior, such as was manipulated in study 1 and used as a basis for the RRC and CRC conditions. On the other hand, this way of simulating CR is likely more realistic than using mathematically random responses as a proxy for CR, as people are rarely capable of engaging in true random behavior (Nickerson, 2002; Schroeders et al., 2022). Thus, to capitalize on the strengths and weaknesses of different CR simulation methods, the following study examines four unique ways of simulating CR, rather than relying solely on a single method. To summarize, this study employs a 2 (sample size: 200 and 500) × 2 (number of items: 10 and 60) × 3 (CR prevalence: 5%, 15%, 30%) × 3 (CR severity: 20%, 50%, 100%) × 4 (CR type: (respondent-derived careless responding [CRC], respondent-derived random responding [RRC], mathematical random responding [MRRC], invariant responding [IRC]) design with 144 unique conditions.

Simulation procedure

Step 1: Generate careful responses

A summary of the simulation procedure is provided in Fig. 2. The first step of the simulation procedure was to generate a series of careful (i.e., uncontaminated) item responses which would then be contaminated with careless responses. Careful responses were generated for the sample size (200 and 500) and number of items (10, 60) conditions (2 × 2 = 4 total conditions). These four careful response conditions served as the baseline item responses for examining how the introduction of CR impacts model fit under different circumstancesFootnote 6. CR was then introduced for each combination of the remaining IVs: CR prevalence (5%, 15%, 30%), CR severity (20%, 50%, 100%), and CR type (CRC, RRC, MRRC, IRC). Thus, for each careful baseline condition, 36 different CR combinations will be examined (3 × 3 × 4 = 36 total conditions per baseline condition; note that 36 × 4 = 144, which is the total number of conditions). This process is described in detail below.

Fig. 2
figure 2

High-level overview of the simulation procedure used in study 2

To generate simulated careful responses, person parameters (i.e., θs) were first simulated from an N(0, 1) normal distribution. Next, item parameters were generated. To generate item parameters, the a and bi values that were computed in the careful condition for study 1 were used. To generate plausible a parameters, values were randomly sampled from a normal distribution of a parameters (see Meade et al., 2007). An examination of the distribution of a parameters in study 1 indicated that these values had M of 1.69 and SD of 0.64 (see Table 3). Thus, a values were randomly sampled from a N(1.69, 0.64) normal distribution. To generate a set of bi parameters, b1 values were first randomly sampled from a normal distribution that corresponded to the distribution of b1 values in study 1 (Table 3). Because the b1 values for the careful condition had M of −3.51 and SD of 1.21, these values were randomly sampled from an N(−3.51, 1.21) normal distribution.

Following previous research (LaHuis et al., 2011; Meade et al., 2007), the randomly sampled b1 values were used as the baseline values for the computation of the remaining three bi values. Specifically, b-to-b distances were randomly sampled from a normal distribution of b-to-b distances, and then added to the initial b1 value in an iterative manner. That is, a b1-to-b2 distance value was randomly sampled from a normal distribution of b1-to-b2 distances and then added to the initial b1 value. This yielded a b2 value. Next, a b2-to-b3 distance value was randomly sampled and then added to the b2 value to yield a b3 and so forth. For the careful condition, this entailed randomly sampling from a b1-to-b2 N(1.22, 0.42), b2-to-b3 N(1.01, 0.34), and a b3-to-b4 N(2.10, 0.57) normal distribution. A conceptual representation of the IRT bi and b-to-b distance values used for this process can be seen in Fig. 3. Upon estimating the person and item parameters, item responses (i.e., datasets) were then generated via WinGen, an IRT-based simulation software package (Han, 2007). All item responses had five response categories to simulate commonly used five-point, Likert-type surveys. The θ, a, and bi values generated above were used to inform the item responses.

Fig. 3
figure 3

Conceptual representation of the IRT bi and b-to-b distance parameters and distributions used for the study 2 data generation process

Fig. 4
figure 4

Conceptual representation of the sampling procedure used for the study 2 data generation process

For each careful condition, 100 sets of θ values and 20 sets of a and bi values were generated. Each set of randomly sampled θ values and a and bi parameters was used to generate a dataset (this process was repeated 100 times for each condition; see below for details). For each set of a and bi values, five sets of item responses (i.e., datasets) were generated (see Fig. 4 for a conceptual representation of the response generation process). Each condition, therefore, contained 100 replications, which aligns with both CR and IRT simulation studies (e.g., Hong et al., 2020; Meade et al., 2007). The goal of generating multiple sets of parameters (five per each of the 20 parameter sets) and item responses/datasets for each parameter set was to derive stable estimates and mitigate error that can occur within the parameter and response generation processFootnote 7. This general process occurred four times for the four baseline conditions (with each condition having 100 replications). Model fit values were then estimated for all replications for each of the four conditions. The fit indices for all 100 replications for each of the four baseline conditions consistently yielded good model fit.

Step 2: Generate careless responses

Next, careless responses were generated. Careless item responses were generated using the same IRT-based response generation procedure described above for the CRC and RRC conditions. The primary difference was that the parameters from the careless and random conditions in study 1, rather than the careful conditions, were utilized (see Table 3). As above, this yielded a set of 100 item responses (i.e., replications) for each condition. For the remaining CR type conditions (i.e., MRRC, IRC), a different procedure for generating item responses was used. For the MRRC condition, responses were computed so as to align with the assumptions of a uniform distribution. In this case, because responses are being simulated for a five-point scale, each response option within the MRRC condition had a 20% probability of being endorsed. For the IRC condition, the simulated invariant response option was evenly distributed within each condition. For example, in the condition with 200 items, 10 participants, 5% prevalence, and 100% severity, 10 participants’ complete response vectors will be replaced with invariant responses. Of these 10 responses, the first two will contain the first response option selected 10 consecutive times (e.g., “1s” in the Likert scale), the next two response vectors will contain the second response option selected 10 consecutive times (e.g., “2s” in the Likert scale) and so forth. Because there is no reason to expect that participants would favor one response option over another when responding invariantly, this method will hold the invariant response options constant and enable an easier interpretation of the impact of invariant responding (DeSimone et al., 2018).Footnote 8 Thus, the process of generating careless responses occurred 144 total times. Because each condition has 100 replications, this yielded 14,400 simulated conditions.

Step 3: Combine careful and careless responses

The next step in the simulation procedure was to combine the careful and careless responses in the appropriate manner and determine how the introduction of CR affects model fit. For each of the 144 conditions, the careless responses replaced a subset of the responses in the careful dataset, thereby yielding a dataset with a mixture of careful and careless responses. The subset of careful item responses that will be replaced will always be the first set of item responses in the careful dataset. For example, in the baseline condition with 10 items and 200 participants, when CR prevalence is 30%, the first 60 careful item responses will be replaced with 60 careless item responses. Likewise, when CR prevalence is 5%, the first 10 careful responses will be replaced with 20 careful responses, and so forth. This process of combining responses occurred 100 times within each of the 144 conditions. Holding the responses being replaced constant is advantageous and provides a way to explicitly compare the effects of different forms of CR on model fitFootnote 9.

Step 4: Compute validity checks and fit indices

The final step of the simulation procedure was to compute a series of validity checks and model fit indices. To accomplish this, and in accordance with recommendations from prior CR research (e.g., Meade & Craig, 2012), multiple CR indices were computed. Specifically, two “traditional” CR indices and two IRT indices (see Beck et al., 2019; Karabatsos, 2003) were used: Mahalanobis distance (D), maximum longstring (LSmax), standardized log-likelihood (lz), and Guttman errors (G; see Curran, 2016 and Drasgow et al., 1985 for details). It was expected that the subset of careless responses within each condition would be more likely to be flagged for CR (as indicated via the four CR indices) compared with the careful responses. Finally, six model fit indices for two latent variable models were computed (χ2, RMSEA, and SRMSR [for CFA models] and M2, RMSEA, and SRMSR [for IRT models]). Unidimensional IRT (via the GRM) and CFA (via maximum likelihood [ML]) models were estimated to align with the models and parameters that were estimated in study 1. All CR and model fit indices were computed 100 times for each condition.

Results

Validity checks

The results of the validity checks showed that the subset of careless responses was more likely to be flagged as careless compared with the subset of careful responses in most cases (see supplemental materials for results across all 144 conditions). To summarize the findings from all 144 conditions succinctly, Table 4 contains the average CR index values averaged across each baseline condition. Higher G, D, and LSmax values, and lower lz values, indicate greater amounts of CR. As seen in Table 4, the subset of careless responses was flagged as having higher amounts of CR. To formally compare the results, a series of t-tests comparing the CR index values between the careless and careful responses were conducted. As Table 4 indicates, 14 out of the 16 possible comparisons were significant. While there were a few rare exceptions to this trend for some of the specific conditions (see supplemental materials), these findings provide strong evidence that the simulation procedure was functioning as intended.

Table 4 Summary of the validity checks averaged across the baseline conditions

Model fit

After affirming the validity of the simulation procedure, the effects of CR on model fit were examined. Formally, this was accomplished by conducting a series of ANOVAs whereby the five simulation conditions/their associated levels (i.e., sample size, number of items, CR prevalence, CR severity, CR type), along with all possible two-way interactions, were entered as IVs with each of the six model fit indices centered as the DVs. For each condition, all 100 replications were included in the analyses. Consequently, each ANOVA was conducted on a dataset containing 14,400 total rows (which is analogous to “participants” within traditional ANOVA frameworks), with each condition having 100 replications (i.e., “participants”). Finally, since the large number of data points would likely make every result significant, η2 values were computed to supplement these findings and assess the magnitude of the various IVs’ effects. All possible two-way interactions were also included in the ANOVAs. While interactions at a higher level (e.g., three-way, four-way, five-way interactions) could have been included, the interactions were limited to two-ways interactions to ease the interpretation of findings.

As seen in Table 5, the overall models for the M2 (F = 1430.20, p < .001), RMSEA (F = 353.29, p < .001) and SRMSR (F = 786.60, p < .001) IRT model fit indices were significant. There are several noteworthy aspects to these findings. First, number of items has the largest impact on the M2 fit index (η2 = .44), which is not surprising since M2 is a function of the number of items. Second, the RMSEA fit index is most impacted by CR severity (η2 = .19) followed by CR prevalence (η2 = .11). Third, SRMSR is equally impacted by CR prevalence and CR severity (η2s = .20). Fourth, CR type has only small/moderate effects across the three IRT fit indices (i.e., η2s ≤ .05)Footnote 10 and a much smaller impact compared to CR prevalence and CR severity. Fifth, the effects of CR across the three fit indices are quite variable. This suggests that the way CR affects fit is a partial function of which fit index is utilized. Lastly, many of the interactions had nontrivial effect sizes (i.e., η2s ≥ .01) thereby further illustrating the complex manner in which CR affects fit. To better understand these interactions, a visualization of each meaningful interaction where η2 ≥ .01 is provided in the supplemental materials. Notably, both the CR prevalence × CR severity and CR severity × CR type interactions had small/moderate effect sizes across the three fit indices. For the CR prevalence × CR severity interaction, model fit deteriorates rapidly as CR prevalence increases and when CR severity is 20%/50% but remains mostly unchanged when prevalence increases and CR severity is 100%. For the CR severity × CR type interaction, model fit was worse when CR was 50% (compared to either 20% or 100%) across CR types. This interaction also shows that model fit is consistently worse for invariant responding across differing levels of CR severity compared to the other CR types.

Table 5 ANOVA results summarizing the IVs effects on IRT model fit indices

Next, As seen in Table 6, the overall models for the χ2 (F = 7257.90, p < .001), RMSEA (F = 543.62, p < .001) and SRMSR (F = 935.60, p < .001) CFA model fit indices were significant (CFAs were again conducted via ML estimation and assumed continuous-like responses). There are several notable aspects of these findings as well. First, number of items has a large impact on the χ2 index (η2 = .77), which is not surprising since χ2 is a function of the number of items. Second, CR prevalence (η2 = .13), CR severity (η2 = .13), and CR type (η2 = .13) have an equal impact on RMSEA. Third, CR severity (η2 = .21) and CR prevalence (η2 = .18) have the largest impact on the SRMSR index, again illustrating that the IVs impact model fit indices in unique ways. Fourth, there were a handful of interactions that displayed nontrivial effect sizes (i.e., η2s > .01). A visualization of every interaction where η2 ≥ .01 is again provided in the supplemental materials. Notably, the CR prevalence × CR type interaction consistently displayed small/moderate effects across the three CFA indices. These interactions show that while increases in CR prevalence are associated with decreases in model fit across all CR types, this increase is more prominent for invariant responding compared to the other CR types.

Table 6 ANOVA results summarizing the IVs effects on CFA model fit indices

There are also some important ways in which these results compare with the results of the IRT fit indices. First, CR type has a much larger effect on RMSEA for CFA model fit (η2 = .13) than IRT model fit (η2 = .02). Second, number of items affects χ2 (η2 = .77) much more than it affects M2 (η2 = .44), though both effects are very large. Third, the SRMSR fit index is similarly impacted by CR prevalence (CFA η2 = .18; IRT η2 = .20) and CR severity (CFA η2 = .21; IRT η2 = .20) across the CFA and IRT models. Fourth, while there was considerable overlap in the main effects and interactions that emerged across the CFA and IRT models, there were still differences present. This confirms that these IVs differentially impact the CFA and IRT models. Put another way, the effects of CR also appear to be a function of the latent variable model employed.

To follow up on these findings, a series of Tukey post hoc tests were conducted for each main effect. The results for all comparisons are provided in the supplemental materials. Of note, the findings showed that the four CR type levels consistently yielded different results, with the effect sizes of these comparisons generally being moderate/large (mean d value across all IRT fit comparisons = 0.39; mean d value across all CFA fit comparisons = 0.87). A close inspection of these results indicated differences between the CRC-RRC and RRC-MRRC conditions. This tentatively suggests that CR is different from random responding and that true randomness is different from respondents’ interpretation and enactment of random responding. Another interesting finding from the Tukey post hoc tests was that when CR severity was 50%, this worsened fit much more compared to when CR severity was either 20% or 100%. This was a consistent finding for all fit indices for both the IRT and CFA latent variable models.

To precisely delineate how CR affects model fit, bias was also computed for each condition, with the bias values for all 144 conditions being reported in Tables 7, 8, 9 and 10. Here, bias is defined as the discrepancy between the fit of the careful/uncontaminated models and the fit indices of the careless/contaminated models. Bias provides a way to precisely quantify the magnitude of the effect that CR has on model fit for each condition. Take the RMSEA value of .01 in the first row of Table 7 as an example. This value indicates that the RMSEA value for both the IRT and CFA models increased, on average, by .01 when CR was introduced within this condition (note that the fit of the baseline/uncontaminated model is provided at the bottom of Tables 7, 8, 9 and 10 to ease interpretations and clearly illustrate how much fit changes). As evidenced by many positive bias values, CR on average causes model fit to worsen (note that higher values for all fit indices are indicative of poorer fit). A close inspection of Tables 7, 8, 9 and 10, however, indicates that these effects are highly variable and that the effects of CR on model fit are worse in some situations than in others (e.g., in the CRC, 30% prevalence, 50% severity, 200 participants, 10 items condition, RMSEA [CFA] increases by .12 on average when CR is introduced). Moreover, while CR worsens fit in most cases, these results also indicate that there are a few cases where the presence of CR improves fit (e.g., in the CRC, 5% prevalence, 100% severity, 200 participants, 60 items condition, RMSEA [IRT] decreases by .01 on average when CR is introduced). In sum, across all 144 conditions, M2 worsened in 79% of cases, IRT RMSEA worsened in 84% of cases, IRT SRMSR worsened in 95% of cases, χ2 worsened in 100% of cases, CFA RMSEA worsened in 97% of cases, and CFA SRMSR worsened in 93% of cases. Collectively, these results indicate that the effects of CR on model fit are largely detrimental.

Table 7 Summary of all model fit bias values (200 participants/10 items)
Table 8 Summary of all model fit bias values (200 participants/60 items)
Table 9 Summary of all model fit bias values (500 participants/10 items)
Table 10 Summary of all model fit bias values (500 participants/60 items)

Discussion

The overarching goal of this study was to examine the effects of CR on model fit across a range of commonly encountered survey situations via a comprehensive, realistic, and rigorous simulation study. Overall, the results of this study indicated that CR tends to deteriorate model fit under most circumstances, though these effects are highly nuanced, complex, and contingent on many factors, which may somewhat account for the inconsistent findings in the literature. For instance, prior research has shown that CR is related to poor model fit (e.g., Woods, 2006), good fit (e.g., Goldammer et al., 2020), and/or is unrelated to model fit (e.g., Liu et al., 2019). As this study shows, while the effects of CR on model fit are contingent on many factors (i.e., sample size, number of items, type of CR, severity of CR, prevalence of CR), CR deteriorates model fit under most circumstances. Moreover, while the effects of CR on fit are generally negative, the magnitude of these effects is quite variable (see Tables 7, 8, 9 and 10). This affirms that multiple model fit indices should be used when evaluating latent variable models since different fit indices are differentially affected by CR. This also demonstrates how minor changes to a survey context can cause model fit to deteriorate in notable ways. Given this, the fact that the effects of CR on fit tend to be negative in most situations—and when coupled with the results of other studies showing how CR can distort research findings (e.g., DeSimone et al., 2018; Huang et al., 2015)—it is important to proactively account for and address CR throughout all aspects of data collection and analysis (see also Aguinis & Vandenberg, 2014; DeSimone et al., 2015). Failing to account for CR all but guarantees that model fit index values are inaccurate to some extent.

Overall, this study indicated that the effects of CR on model fit are most detrimental (i.e., model fit bias is highest) for invariant (i.e., IRC) responding, when prevalence was high (i.e., 30%), and when severity was 50%. This was highly consistent for all six fit indices and the CFA and IRT models. For M2 and χ2, the bias values were highest when the number of items was 60 and there were 500 participants, though this was not the case for the other four fit indices (which, unlike these two indices, are not a direct function of the number of items in a survey). Regarding the results of the ANOVAs, CR prevalence and CR severity consistently has the largest impact on the RMSEA and SRMSR fit indices across both CFA and IRT models (with M2 and χ2 again being heavily influenced by the number of items). These findings overlap considerably with the model fit bias values reported in Tables 7, 8, 9 and 10. It was somewhat surprising that the effects of CR on model fit were consistently worse when severity was 50% rather than 100%. This could be because data were generated according to different models when severity was 50%, but the same model when severity was 100%. Based on these findings, it would be beneficial for future research to examine this in more detail, especially since partial CR is more realistic than full CR.

This study also provides initial evidence that different types of CR have unique effects on model fit, though these effects are more pronounced for certain indices than others. For example, the effect size of the CR type variable was largest for the RMSEA (CFA) fit index, with the effect sizes of the three IRT fit indices being somewhat smaller. Moreover, the Tukey post hoc tests indicated that there were notable differences in the impact of the four CR types on fit. As noted above, these findings showed that invariant responding has the most detrimental impact on fit. Fortunately, this type of responding is the easiest to identify and remove. Additionally, these findings showed that the effects of CRC differ from the effects of both RRC and MRRC and that RRC yields different results than MRRC. The former finding suggests that CR may be distinct from random responding. An inspection of the results of the validity checks indicated the CRC condition was more likely to contain a mixture of random and invariant responding (as evidenced by the larger Longstring values), whereas random responding was less likely to contain this type of CR (as evidenced by the smaller Longstring values). Moreover, the fact that RRC and MRRC did not have a similar impact on model fit suggests that respondents’ interpretation and enactment of random responding is not equivalent to true random responding. If these were the same response behaviors/patterns, the effects of CR on fit across these two types of CR would be expected to align much more closely, which they do not. This finding corresponds with a large body of research showing that people struggle to recognize and enact true random behaviors (see Nickerson, 2002), and suggests that this well-established finding may also extend to survey responding. This finding has important implications for CR simulations. For instance, this suggests that the common assumption within simulations that true random responding (e.g., randomly generating responses from a uniform distribution) can be used as a proxy for CR may not be an accurate representation of how CR manifests within surveys.

Theoretical and practical implications

The results of this study have important implications for theories of CR, latent variable theory (e.g., Borsboom, 2008), and theory development more generally. For example, this study shows that sufficiently good model fit can be attained when up to 30% of a sample is engaging in CR. It is debatable that such a latent variable model, even with “good” fit, can still be considered valid since many responses are careless and lack validity. The failure to account for CR would, however, cause one to incorrectly conclude that the “good” fitting model is valid/representative of the underlying data (see also Arias et al., 2020). This situation shows that good model fit is a necessary, but insufficient, feature of a valid latent variable model. Latent variable models must also have valid (i.e., not careless) responses underlying the model. As another example, imagine a situation where CR causes the model fit indices for a latent variable model to exceed the threshold for what constitutes good fit (i.e., the fit of the RMSEA index changes from .07 to .09 due to the presence of CR). In this situation, one might incorrectly conclude that the model is not appropriate when poor fit is being driven by careless responses underlying the model. In both these situations, CR needs to be assessed to accurately evaluate fit. In the former example, model fit does not account for CR and causes the model to be viewed as appropriate when it is not (e.g., false positive). In the latter example, CR decreases the fit of the model and causes the model to be rejected, despite CR being the reason for the poor fit (i.e., false negative). Evaluating the appropriateness of models solely based on fit is, therefore, not appropriate, as model fit is not necessarily a reliable proxy for CR, despite the largely positive relation between CR and fit. That is, CR and (poor) model fit should be separately assessed and accounted for.

The results of this study also show there are limitations to the use of arbitrary model fit cutoffs (e.g., RMSEA < .08; see also Nye & Drasgow, 2011). As noted above, “good” fit can sometimes be obtained with high amounts of CR prevalence (30%) and severity (100%). Thus, many seemingly good fitting models that are published may be laden with CR. Likewise, many seemingly poor fitting models may not be published, even though poor fit may be attributable to CR in these cases. To further illustrate these considerations, consider the various scenarios in Fig. 5 (labeled “A” through “F”). In most instances (except Scenario E), model fit improves as CR is accounted for, albeit to varying degrees due to the variable effects of CR on model fit.

Fig. 5
figure 5

Matrix illustrating various scenarios that can be encountered for CR (low, high) and model fit (poor, good). The vertical line represents the threshold for model acceptance/rejection (e.g., good/bad fit) that is commonly used within latent variable frameworks (e.g., RMSEA < .08). The shaded circles represent the fit of a model when CR is not accounted for. The dotted circles represent the fit of the models when CR is accounted for. Except for Scenario E, these examples also illustrate the effects of CR are mostly detrimental to model fit. The variability in the length of these shifts in model fit illustrates the variable effects that CR has on fit, which this study shows depends on specific IVs, combinations of IVs, fit indices, etc

While proactively addressing CR and ensuring one has high-quality data should be standard practice for proper theory development and sound scientific practices, it is unclear the extent to which this is done as published research does not always specify how CR was identified and addressed. As this study, and others, demonstrate (e.g., DeSimone et al., 2018; Huang et al., 2015; Kam & Meyer, 2015; Woods, 2006), CR distorts research findings, and this may result in incorrect theoretical conclusions. Moreover, CR has also been speculated to be a potential contributor to the replication crisis in psychology (Curran, 2016; Open Science Collaboration, 2015). This seems plausible since, all things being equal, two studies containing differing amounts of CR are likely to yield different results. Thus, to develop more accurate theories and ensure a more robust science, accounting for and addressing CR should be a standard practice, and details on how CR is addressed should be regularly included in research publications. This sort of rigor is common for evaluating and reporting latent variable models (e.g., imagine the challenge of publishing a latent variable model without including any information on model fit) and can easily be done for CR via reporting the results of multiple CR indices (see Curran, 2016 DeSimone et al., 2015, and Hong et al., 2020 for guidance on how to employ such indices).

The results of this study also have important practical implications for researchers and practitioners who employ surveys. For instance, this study highlights the situations where the effects of CR are most detrimental on model fit as well as those situations where the effects of CR are negligible. Such information can be helpful for practitioners looking to identify when strategies for reducing CR may be needed. This can be especially useful in contexts where survey results have the potential to meaningfully impact systems, operations, and policies. Knowledge of when CR needs to be addressed, and then subsequently addressing CR in these situations, may also be important for legal reasons. For example, if key organizational and educational decisions (e.g., admissions decisions, pay increases, promotions, terminations) are based on data that is laden with CR, such decisions are precarious, unlikely to be justifiable, and could potentially be subject to legal action. Put succinctly, data-driven decisions will be much more defensible if decisions are based on high-quality data. The results of this study can also be used by practitioners in a broader sense to gain insights into how CR may impact their results prior to data collection. For example, if a practitioner is interested in how long surveys with small samples and high levels of CR may affect their data, they could look to the conditions with 60 items, 200 participants, and 30% prevalence as a source of guidance about what to expect. This sort of benchmarking information could be useful for planning interventions and determining when additional steps will be needed to ensure survey data quality.

Limitations and future directions

Despite the important contributions of this study, some limitations and areas for future research should be noted. First, the generalizability of this simulation (study 2) is limited to the parameters that were derived from the conscientiousness measure in study 1. While many other parameters, and combinations thereof, could have been used, the parameters that were used yielded data that were negatively skewed. While the results of this study are likely to generalize to other constructs that are negatively skewed, future studies should examine how CR affects model fit when different parameters and response distributions are used. Second, while the careless behaviors in study 1 were shaped in a manner consistent with how other response behaviors are manipulated (e.g., faking; see Zickar et al., 2004), it could be the case that peoples’ naturally occurring CR does not align with experimentally shaped CR. Determining this is an inherent limitation of CR studies, however, since the true nature of participants’ responses can never be known with certainty. Despite this limitation, it is important to note that the manipulation checks and additional data cleaning that was conducted in study 1 indicated that people generally followed the instructions and that this is a more realistic approach to conducting CR simulations than what is typical in the literature. Additionally, both mathematically random and invariant responding were also examined. Because this study employed a larger number of operationalizations of CR than what is typical for CR simulation studies, this helps address the limitations of relying solely on a single CR simulation method (e.g., only mathematically random responding). Third, while this study examined many IVs, other IVs could have been examined (e.g., situations where half the respondents responded invariantly, and the other half responded randomly). This could also include other CR patterns (e.g., random responding around the scale midpoint and diagonal responding; see Curran & Denison, 2019; Ulitzsch et al., 2022). Likewise, while the fit indices in this study were selected due to their prominence within the literature, it could have also been possible to examine other model fit indices (e.g., CFI, AIC). Future research should, therefore, continue to examine the effects of CR on model fit with other IVs and/or fit indices (or even multidimensional models). Lastly, while this study provides strong evidence that CR deteriorates model fit under most situations, this study cannot account for why this occurs. Now that this study has more clearly established the CR–fit linkage, a logical next step is for future research to investigate potential causal mechanisms in greater detail.

Conclusion

While surveys remain a very useful method for gathering data across various research settings—and model fit indices remain a crucial way to evaluate latent variable models—a nontrivial number of people who complete surveys engage in CR. Despite the importance of model fit and the ubiquity of CR, few studies have explored this CR-model fit linkage in detail. The results of this comprehensive, rigorous, and realistic simulation study show that the effects of CR on model fit are mostly negative. It is, therefore, crucial to account for and address CR wherever surveys and latent variable models are being employed. Failing to do so all but guarantees that one’s results are inaccurate, and this can lead to misleading research conclusions.