Full Length Article
Differentiating conscientious from indiscriminate responders in existing NEO-Five Factor Inventory-3 data

https://doi.org/10.1016/j.jrp.2019.05.009Get rights and content

Highlights

  • The mean Inter-Item Standard Deviation (M-ISD) identifies indiscriminate responders in questionnaire data.

  • We compared the M-ISD against four post-hoc validity indexes, and an embeddable validity scale, in three data sets.

  • The M-ISD outperformed all other validity indexes, bolstering its use in existing data sets to identify responders.

Abstract

The mean Inter-Item Standard Deviation (M-ISD; i.e., the mean of several single-scale ISDs) is a post-hoc validity index that statistically differentiates conscientious responders (CRs) from indiscriminate responders (IRs) in psychological questionnaire data. We compared the M-ISD’s effectiveness against four other post-hoc indexes and an embeddable validity scale in three sets of NEO-Five-Factor Inventory-3 data. Results showed the M-ISD has superior classification ability over all other post-hoc indexes and even outperformed the embeddable validity scale. The average classification accuracy of the M-ISD and embeddable scale was 97% and 93%, respectively, whereas only one of the four remaining post-hoc indexes exceeded our classification accuracy criterion of 80%. These findings suggest researchers can use the M-ISD to differentiate valid from invalid data.

Introduction

With the explosion in popularity of Internet-mediated research and online data collection, particularly in MTurkified areas such as social and personality psychology (Anderson, Allen, Plante, Quigley-McBride, Lovett, & Rokkum, 2018), the hazards of indiscriminate responding (IR) in psychological questionnaires has gained the considerable attention of researchers. IR is a pseudo-random response style in which responders generate item responses indiscriminately, carelessly, unsystematically, or without regard for their semantic content. We prefer calling the style indiscriminate as opposed to other labels it is known by (careless, inattentive, and invalid responding, insufficient effort), as a means to describe the unsystematic response pattern it makes, and not speculate about the motivations and traits of those who respond that way.

Once thought limited to attenuating effect sizes, increasing only Type II error, researchers now have greater insight about how IR can also magnify statistical effects and cause Type I error (Credé, 2010, Holden et al., 2018, Holtzman and Donnellan, 2017, Huang et al., 2015, Marjanovic et al., 2015). Especially in the wake of the replication crisis in psychological science, which shows a paltry replication rate between 36% and 62%, all researchers can benefit by removing the contaminating influence of IR from their data (Camerer et al., 2018, Open Science Collaboration, 2015). Cleaner data sets will improve the replicability of research results. This is particularly true of academic researchers because they tend not to use commercially available measures that have embedded validity scales capable of detecting IR (e.g., the MMPI series; Berry, Wetter, Baer, Larsen, Clark, & Monroe, 1992). They leave those proven measures to applied testers, such as clinicians and forensic counsellors; testers who can afford to use them. Most basic-science researchers do little to identify IR because of a lack of awareness of the problem or the means to do so. For example, in a review of 94 articles published in the Journal of Research in Personality in 2018, only seven studies (7.45%) made efforts to use established validity scales or other cues to identify indiscriminate responders.1

The impetus of Marjanovic, Struthers, Cribbie and Greenglass (2014) work was to develop simple and accessible validity tools that provide all researchers with an efficient means to differentiate CR and IR data (i.e., data generated by conscientious and indiscriminate responders, respectively). One tool we developed was proactive. It required researchers to embed it stealthily in a questionnaire before administering it to responders. The Conscientious Responders Scale2 (2014) is a 5-item, embeddable validity scale which directs responders exactly how to answer each of its items (e.g., Please answer this item by choosing option two—Disagree). A researcher would randomly embed its items throughout a questionnaire’s length to measure rates of IR in all of its sections: its beginning, middle, and end. Correct item responses are scored as 1s, whereas incorrect and missing responses are scored as 0s. Item scores are then summed to make a Total Score, which ranges between 0 (where all items are answered incorrectly) and 5 (where all items are answered correctly). The advantage of the CRS’ instructional items is that testers can use binomial probability theory to calculate the likelihood of an IR answering items correctly by chance alone. They can use theory to derive a priori cut-off scores, which sets the CRS apart from most other validity scales that require normative testing to accomplish the same. On a 5-point response scale, binomial probability informs us that the chances of an IR achieving a CRS total score of 3, 4, or 5 is 5.79%, just above psychology’s trusted p-value criterion of 0.05. Testers can therefore be confident that all responders with high CRS scorers (3 to 5) are unlikely to be IRs and are by default labelled CRs. They can similarly be confident in labelling low CRS scorers (0 to 2) as IRs because their data are indistinguishable from random data. Recent research shows that embedding the CRS in a questionnaire consisting of multiple measures has no negative consequences on the psychometric properties of those measures (Breitsohl and Steidelmuller, 2018, Marjanovic et al., 2018). Questionnaires with or without an embedded CRS yield similar means, standard deviations, Cronbach alpha coefficients, and factor structures. Furthermore, the unusual nature of the item content does not irritate or frustrate responders into ironically causing them to engage in more IR, as some researchers have speculated (Meade and Craig, 2012, Niessen et al., 2016).

When employed in actual questionnaires, the CRS performs quite well (e.g., Guilfoyle et al., 2019, Ortner et al., 2018, Francavilla et al., 2018). The CRS consistently and accurately distinguishes CR and IR at between 90 and 95% classification accuracy (the average of its sensitivity [or proportion of CRs correctly classified as CRs] and specificity [or proportion of IRs correctly identified as IRs]; Marjanovic et al., 2014, Marjanovic et al., 2015). As opposed to most validity scales that aim to identify IRs, the CRS was developed to identify CRs. Because there are a variety of reasons why IR occurs (e.g., carelessness, fatigue, linguistic incompetence, psychopathology, etc.; Johnson, 2005, Nichols et al., 1989), it is difficult to tease them apart. IR response patterns are often confounded by multiple possible explanations (e.g., answering true to a bogus item like “I talk to dead people” may indicate either IR or mental illness). For the sake of simplicity, Marjanovic et al. (2014) concentrated their efforts on differentiating CRs from all other response patterns. CR data are what researchers hope to retain in their data sets. Non-CR data, for whatever reason it occurs, is what they hope to cull.

Alternative scales made up of instructional items have performed impressively at correctly identifying CR data (true positives), although in many cases it is not as clear how well they do at also correctly identifying IR data (true negatives; Kam & Chan, 2018). Too often, validity scales are good at identifying CRs and awful at identifying IRs, or vice versa (Huang et al., 2012, Maniaci and Rogge, 2014, Meade and Craig, 2012, Niessen et al., 2016). Researchers contribute to this problem by employing such liberal criteria to identify IRs (e.g., completing questionnaires too quickly, failing a single instructional item) that they cast too wide a net and misidentify CRs as IRs. It is reasonable to expect that for a validity scale to have any real value it should be good at doing both, preferably with classification accuracies between 80% and 100% (Clark, Gironda, & Young, 2003).

Another problem with alternative instructional-item validity scales is that they often seem haphazardly constructed or presented with little to no information about their validity or the cut scores used to differentiate responders (Kam and Chan, 2018, Kim et al., 2018). The hazards of accepting vague details from developers about their validity scales is that too much of the decision-making for identifying responders is entrusted to researchers’ individual judgment. Without guidance, researchers may intentionally or unintentionally “cherry pick” features of a validity scale’s result that enhance their ability to reject null hypotheses (Curran, 2016). Ironically, researchers may use validity scales to p-hack their way to more favorable data sets (Head, Holman, Lanfear, Kahn, & Jennions, 2015). There is reason to be worried. Researchers have a habit of generating results that systematically suit their hypotheses or their benefactors (Bekelman et al., 2003, Makel et al., 2012). To redress these tendencies, researchers may consider submitting their a priori criteria for screening data to third-party online repositories, such as they are encouraged to do now with study designs and hypotheses, in the spirit of open science (Klein et al., 2018).

The CRS’ main limitation is that it only works prophylactically—by embedding it in questionnaires before the responders complete them. Fortunately, Marjanovic and colleagues developed a second validity tool to address these scenarios, one that researchers apply retroactively to existing data sets. The Inter-Item Standard Deviation (ISD; Marjanovic et al., 2015) is a within-subjects statistical index of inter-item response consistency. It is the average difference between a person’s individual item responses and their total mean score across all of those items. A small ISD indicates a narrow dispersion of responses across a set of items measuring a single construct, whereas a large ISD indicates a wide dispersion of item responses. Because we expect CRs to answer items honestly and accurately, we expect them to answer consistently across a set of items, producing small ISDs. Oppositely, given that IRs answer items without regard for their semantic content, we expect their response dispersions to be much greater, producing large ISDs.

In the first tests of the ISD, it correctly differentiated CRs and IRs with classification accuracies between 0.57 and 0.89 (Marjanovic, 2010, Marjanovic et al., 2015). In principle, the concept behind the ISD was successful, but it did not work well enough. The error rate was still too high to use it to expunge suspected IRs from data sets. Researchers then applied the principle of aggregation to the ISD to improve its discriminatory power. “The principle of aggregation states that the sum of a set of multiple measurements is a more stable and representative estimator than any single measurement” (Rushton, Brainerd, & Pressley, 1983, p. 18). Therefore, instead of just using only single-scale ISDs to differentiate responders (that is, ISDs calculated from all of the items that measure a single psychological construct), all of the single-scale ISDs were averaged into a mean ISD score. When aggregated, the ISD’s discriminatory power jumped to between 95% and 98% classification accuracy, outperforming even the CRS (Marjanovic et al., 2015). Francavilla et al. (2018) recently confirmed this high accuracy rate in a study where they generated an ISD from a 300-item Big-5 personality inventory. Their ISD correctly classified responders with a classification accuracy exceeding 95%.

Since publication, researchers have used the ISD in a number of studies to ferret out IRs in questionnaire data (e.g., Francavilla et al., 2018, Kurtmollaiev et al., 2018, Marjanovic et al., 2018, Zeigler-Hill and Trombly, 2018). When compared against other validity indexes, the ISD performs more similarly to other response-consistency indexes such as inconsistency scales and even-odd indexes, and differently from item-based indexes, such as infrequency/bogus items and instructional-item checks (Francavilla et al., 2018). As Curran (2016) noted, one of the limitations of the ISD is that long strings of identical responses produce ISDs near zero. For example, if on a 10-item scale with no negatively keyed (reverse-scored) items, a responder selected the same response consistently across all items, the responder’s ISD would be 0.00 and labelled as a CR. We point out that most personality measures include negatively keyed items, which would produce ISDs greater than zero. In addition, because IR is by definition unsystematic, the likelihood is that a similar number of long-string responders would answer items above the midpoint of a scale as they would answer items below the midpoint. Provided the measure has an expected mean score near its scale’s midpoint, straight-line responding would reduce a sample’s response variance and exacerbate only Type II error (Credé, 2010, Marjanovic et al., 2015). Given that most psychological researchers set their alpha levels (<0.05 = probability of a false-positive finding) lower than their beta levels (<0.20 = probability of a false-negative finding), we can infer that to them, Type II error is the lesser of the two evils (Banerjee, Chitnis, Jadhav, Bhawalkar, & Chaudhury, 2009).3

Section snippets

The present investigation

The purpose of this investigation is to demonstrate to researchers the value and robustness of the mean ISD as a post-hoc validity index for existing data. To do so we compared its classification accuracies against several of the best performing validity indexes on a widely known personality inventory: the NEO-Five-Factor Inventory-3 (NEO-FFI-3; McCrae & Costa, 2010). We specifically choose the NEO-FFI-3 for our comparisons because it consists of multiple well-validated subscales, which are a

Participants

Samples 13. We recruited participants for all three samples from a medium-sized Western Canadian university in exchange for course credit in introductory psychology classes. Samples sizes were 130, 109, and 119, respectively. The mean ages for these samples were 20.48 (SD = 4.39), 20.91 (SD = 5.26), and 19.85 (SD = 3.70) years, respectively. Samples included mostly women: Sample 1 = 70.00% with 2 responders undisclosed, Sample 2 = 68.81% with 1 responder undisclosed, and Sample 3 = 64.71% with

Results

Table 2 presents descriptive statistics for all NEO-FFI-3 scales and 11 validity indexes across the three samples. This includes means, standard deviations, mean inter-item correlations, and Cronbach’s alphas. Table 3 presents correlations between responder group (CR = 0, IR = 1), CRS, and all post-hoc validity indexes. We expected responder group to positively correlate with the inter-item standard deviation, inconsistency scale, and long string index (for which higher scores indicate IR), and

Discussion

The purpose of this investigation was to examine the efficacy of the mean ISD as a means to differentiate conscientious and indiscriminate responders in existing data. We compared its performance against that of four widely used post-hoc validity indexes (a psychometric inconsistency scale, Jackson’s (1977) Even-Odd Index, Mahalanobis Distances, and the Long-String Index) in a personality inventory noted for not having validity indexes of its own: the NEO-Five-Factor Inventory-3. We embedded a

Conclusion

With the explosion in popularity of Internet-mediated research and online data collection, the need for effective means to identify indiscriminate responders has never been so great. The simplest and most efficient way of detecting IR is with an embedded validity scale in the questionnaire itself, but this does not help identify IR in existing data sets or in data that do not already contain such validity scales. In these type of unprotected data, researchers have a range of post-hoc validity

Authors’ Note

Zdravko Marjanovic and Ronald R. Holden shared responsibilities for all aspects of this paper from conceptualization to report writing. We thank Lisa Bajkov, Jennifer MacDonald, Ian Fung, and Noor Shubear for their assistance with this research. This research did not receive any funding from the public, commercial, or not-for-profit sectors.

We pre-registered all our study hypotheses at the Open Science Framework repository (https://osf.io/2v4t5/). We also housed there our three data sets and

References (56)

  • C.A. Anderson et al.

    The MTurkification of social and personality psychology

    Personality and Social Psychology Bulletin

    (2018)
  • R.A. Baer et al.

    Detection of random responding on the MMPI-A

    Journal of Personality Assessment

    (1997)
  • A. Banerjee et al.

    Hypothesis testing, type I and type II errors

    Industrial Psychiatry Journal

    (2009)
  • R.F. Baumeister

    On the stability of variability: Retest reliability of metatraits

    Personality and Social Psychology Bulletin

    (1991)
  • J.E. Bekelman et al.

    Scope and impact of financial conflicts of interest in biomedical research: A systematic review

    Journal of the American Medical Association

    (2003)
  • D.T.R. Berry et al.

    MMPI-2 random responding indices: Validation using self-report methodology

    Psychological Assessment

    (1992)
  • H. Breitsohl et al.

    The impact of insufficient effort responding detection methods on substantive responses: Results from an experiment testing parameter invariance

    Applied Psychology

    (2018)
  • C.F. Camerer et al.

    Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015

    Nature Human Behaviour

    (2018)
  • N.D. Christiansen et al.

    Using item-level covariance to detect response distortion on personality measures

    Human Performance

    (2017)
  • M.E. Clark et al.

    Detection of back responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices

    Psychological Assessment

    (2003)
  • M. Credé

    Random responding as a threat to the validity of effect size estimates in correlation research

    Educational and Psychological Measurement

    (2010)
  • J. Cohen

    A power primer

    Psychological Bulletin

    (1992)
  • P.T. Costa et al.

    Stability and change in personality assessment: The revised NEO Personality Inventory in the year 2000

    Journal of Personality Assessment

    (1997)
  • J.A. Desimone et al.

    Best practice recommendations for data screening

    Journal of Organizational Behavior

    (2015)
  • P. Domingos

    The role of Occam's razor in knowledge discovery

    Data Mining and Knowledge Discovery

    (1999)
  • M. Dupuis et al.

    Detecting computer-generated random responding in questionnaire-based data: A comparison of seven indices

    Behavior Research Methods

    (2018)
  • N.M. Francavilla et al.

    Social interaction and internet-based surveys: Examining the effects of virtual and in-person proctors on careless response

    Applied Psychology

    (2018)
  • J.R. Guilfoyle et al.

    Sorry is the Hardest Word to Say: The Role of Self-Control in Apologizing

    Basic and Applied Social Psychology

    (2019)
  • Cited by (0)

    View full text