Being able to distinguish well-justified arguments from weakly justified arguments is crucial for avoiding misinformation, for academic success, and for informed citizenship. As Wilkinson et al. (2017, pp. 65–66) put it, argument literacy is a “fundamental life skill for productive participation in contemporary society”. Nevertheless, the determinants of individual differences in this skill are not well known. Within the research field of argument literacy, much of the literature centers on people’s ability to produce arguments, while the literature on how people receive and evaluate argumentation is scarcer (Ennis and Weir 1985; Graff 2003; Reznitskaya and Anderson 2006; Voss and Van Dyke 2011). When focus is on argument evaluation, the aim is often identifying what types of arguments people on group level find convincing (e.g., Hornikx et al. 2022). In contrast, the present psychological study takes an individual differences perspective on the ability to distinguish strong from weak arguments.

In classical truth-functional propositional logic, statements are either true or false and validity is determined based on truth-values and form (Klement, n/d). In a radical effort to harness argumentation analysis for real-life argumentation, Northern American informal logic and European argumentation theory have since the 1960s moved away from a binary truth-concept and a formal understanding of validity, accepting much more dynamic forms as well as arguments that are only probable (van Eemeren et al. 2014, pp. 31–39). In these theories, arguments are evaluated based on criteria related to relevance, acceptability, and sufficiency (Johnson and Blair 1977) or how well they fulfil the requirements of the argument scheme in which they occur (van Eemeren & Grootendorst 2004). In this view, argumentation can theoretically be defined as an “activity aimed at convincing a reasonable critic of the acceptability of a standpoint by putting forward a constellation of propositions justifying or refuting the proposition expressed in the standpoint” (Grootendorst and van Eemeren 2004, p. 1) or “the practice of justifying claims under conditions of uncertainty” (Zarefsky 2019, p. 3). Evaluating how well arguments support claims can also be called distinguishing argument strength.

Among psychologists, no clear models have been proposed to describe the cognitive processing stages involved in evaluating arguments (Britt et al. 2014). Nevertheless, the findings are largely in line with the contention that understanding how arguments relate to claims is an important processing stage. For instance, when laypeople think aloud while evaluating arguments, the majority of the issues they raise concern the relevance and appropriateness of the arguments for supporting the claims (Schellens et al. 2017). Moreover, evaluating the relationship between claims and arguments is one of the key skills that sets more skilled arguers apart from less skilled arguers, and which educators hope to develop among students (Kuhn and Modrek 2017; Larson et al. 2009; Münchov et al. 2019; Shaw 1996).

1 Evaluating Different Types of Arguments

To ensure that research findings are generalizable, it is important to cover different types of arguments comprehensively. Existing research on argument evaluation has often focused in depth on one type of argument at a time, in particular on arguments from consequences (Bonnefon 2012; Hahn 2020). However, everyday discourse contains a multitude of argument types. Arguments can be categorized in many different ways (Garssen 2001). For example, in the theory of argument schemes, arguments are categorized by the types of relationships that exist between claims and evidence (van Eemeren et al. 2004; Walton et al. 2008). Argument schemes have been called “the most useful and widely used tool so far developed in argumentation theory” (Walton et al. 2008, p. 1). The importance of argument schemes (also “argumentation schemes”) is widely acknowledged, but there is no consensus on their typology. Suggestions wary from dozens of schemes to only a few (van Eemeren et al. 2014, pp. 19–21; Walton et al. 2008). More schemes add specificity, which is useful when specific types of argumentation are analyzed or evaluated.

In the present study, we use the categorization suggested by Pragma-Dialectic argumentation theory, which is that all arguments fall under three argument schemes: causal arguments, analogous arguments, and symptomatic arguments (van Eemeren et al. 2004). This typology has gained popularity due to the recognition of Pragma-Dialectics but also due to the practical aspect of dealing with a small number of categories. However, due to the importance of arguments from authority and their role in societal debate, psychologists studying argumentation sometimes separate arguments from authority as a fourth type (e.g., Hoeken et al. 2012), although in principle these arguments are also symptomatic. Arguments 1–4 are some of the arguments used in the present study. They present examples of strong arguments from these four argument schemes.

Argument 1. Sixteen-year-olds should be allowed to vote in local elections, because it would increase young people’s participation in political debate. (Causal)

Argument 2. Sixteen-year-olds should be allowed to vote in local elections, because a similar reform was successful in Estonia. (Analogous)

Argument 3. Sixteen-year-olds should be allowed to vote in local elections, because young people want to make a difference. (Symptomatic)

Argument 4. Sixteen-year-olds should be allowed to vote in local elections, because a large study on voting turnout commissioned by Parliament recommends that they should. (Authority)

Using the “critical questions” formulated for each argument scheme (Grootendorst and van Eemeren 2004), on causal arguments reasoners should ask themselves how likely the presented consequences are: “Does the established cause, in fact, lead to the mentioned result?” (Hitchcock and Wagemans 2011, p. 188; van Eemeren and Grootendorst 1992, pp. 98–106). In the example above, one should evaluate the likelihood that allowing 16-year-olds would indeed increase their participation in political debate. For arguments from analogy, reasoners should evaluate the relevant similarities and differences between the points of comparison (the Finnish context in which the study was conducted, and Estonian society). For symptomatic arguments, reasoners should evaluate whether the property that is mentioned is characteristic, a sign, or a symptom of what is claimed. In the example above, this amounts to evaluating whether wanting to make a difference is enough of a sign that a reform of voting age is in place. Lastly, for arguments from authority, the task is to assess the credibility of the purported authority (a study commissioned by Parliament) for deciding on the issue of voting age. Credibility may be compromised, for example, if the authority has vested interests, or lacks relevant expertise.

Previous studies of argument evaluation have found that people largely evaluate arguments in line with the critical questions (Demir and Hornikx 2022; Hoeken et al. 2014; Hornikx and Hoeken 2007). The two aims of the present study are to determine whether the ability to do this is unitary across these different types of arguments, and which cognitive factors predict this ability. The study was preregistered at the Open Science Framework, https://osf.io/ayc5p/?view_only=65d97111808a49bab1fbf4f81726fa58. We followed the preregistered study design, sampling, variables, and assessment measures. We deviated from some of the preregistered hypotheses for reasons described below.

The first aim of the present study was to investigate whether the ability to recognize argument strength in one type of argument generalizes to recognizing strength in the others. Because different types of arguments require thinking about different aspects of issues, it is possible that they to some extent draw on different cognitive abilities or skills, and on different educational and cultural factors. For example, recognizing relevant authorities regarding different issues might be more dependent on cultural literacy or general knowledge than evaluating other types of arguments is. Thus, it is possible that some people are particularly good at one type of argument and poor at others, so that evaluation would not correlate highly across types, but evaluation of arguments within each type would correlate with each other.

Because the critical questions associated with each argument scheme have been clearly described in the literature, we have clear guidelines for designing test materials with arguments that are strong or weak. Selection of test materials based on empirical pretests without explicating a theoretical rationale has recently been criticized by Hoeken et al. (2020). To ease comparison of different types of arguments, the present study presents arguments in standardized format as short one-sentence items that contain a claim followed by an argument. This format simplifies each argument to their bare bones. Further, to minimize the influence of topic knowledge, we use different types of arguments for the same claims. Based on the theoretical criteria described for each argument scheme (van Eemeren et al. 2004), we have thus constructed pairs of arguments, where one argument is in line with the criteria described by the critical questions, and the other violates them. The participants are instructed to evaluate numerically how well the argument supports the claim. This design allows us to ask about the generality of the trait of argument evaluation, and to determine whether the ability to discriminate strength of arguments is a general trait across the four types. Thus, Research Question 1 was whether it is possible to dissociate the abilities to discriminate the strength of arguments depending on what type of argument is involvedFootnote 1.

2 Cognitive Factors Predicting Argument Strength Discrimination

The second aim of the present study was to investigate a set of theoretically motivated cognitive factors related to argument evaluation ability, and in case evaluation of different argument types forms separate abilities, whether the predictors are the same for all types of arguments.

First, we investigated whether argument evaluation ability is related to analytic thinking styles. Analytic thinking styles is an umbrella term that covers various dispositions. Some of these are more concerned with the quality of thinking while others with the quantity of thinking.

Among dispositions related to the quality of thinking, we find beliefs about what makes good thinking. Baron (1995) found that one reason for weak argumentation skills among students lay in the belief that one-sided argumentation is better than argumentation that considers both sides of an issue. That is, weak argumentation skills were related to weak endorsement of standards of thinking known as Actively Open-Minded Thinking (AOT; Baron 1993, 2019; Baron et al. 2023). Subsequent studies, using self-reports of AOT, support this idea (Baron 2019; Baron et al. 2023; Stanovich and West 1997; Stanovich and West 1998; Svedholm and Lindeman 2013; Wolfe 2012 ; Wolfe and Britt 2008; also see Stupple et al. 2017; Weinstock and Cronin 2003). Kuhn (1991) has also reported that some people maintain their favored explanation for various social phenomena and refuse to search for counterarguments, sometimes strongly (e.g., Kuhn, ch. 5, especially p. 138), and resist the implications of counterevidence when asked about it (ch. 8, pp. 229–233). In line with these findings, we expect variation in AOT to be positively related to the ability to distinguish the strength of all the four studied types of arguments. Because people with highly actively open-minded thinking are more concerned about possible alternatives and objections to a favored conclusion, we expect them to be more likely to discover the presence and absence of such alternatives and objections, e.g., to the relevance of an analogy, or the likelihood and seriousness of consequences.

Another related thinking disposition that focuses on the quality of thinking is intellectual humility, which is the willingness to admit that one may be wrong (Krumrei-Mancuso and Rouse 2016). Because accurately recognizing the strength of arguments may require being open to conclusions other than personally favored ones, we also expect intellectual humility to support argument evaluation ability. As a helpful reviewer pointed out, the dispositions related to the quality of thinking that we study are very similar to the “epistemological virtues” (Zagrebski 1996) and “argumentational virtues” (Aberbein, 2010; Cohen 2005) described by philosophers. Thus, our search for cognitive dispositions related to argument literacy resembles the focus of virtue theories of argument on distinguishing qualities of good arguers, rather than good arguments.

Among analytic thinking dispositions more concerned with the quantity of thinking, the most studied are the Need for Cognition (NFC), and the dispositions and skills measured by the Cognitive Reflection Test. Need for Cognition describes the amount of thinking that individuals enjoy engaging in (Cacioppo et al. 1996), and the CRT is designed to measure the ability to refrain from making impulsive judgments, and the amount of thinking that individuals tend to engage in before giving an answer (Baron et al. 2023; Frederick 2005). Studies on the relationships of these dispositions to argument evaluation are few, but they indicate that CRT performance is positively related to argument evaluation (Stupple et al. 2017), and that people high in NCF are more likely to approach argumentative situations (Nussbaum and Bendixen 2003). Based on this and on their well-established roles in predicting academic success and reasoning (Cacioppo et al. 1996; Stanovich 1999, 2009), they likely play a role in argument evaluation as well. If the CRT measures more time spent on thinking and more concern with accuracy, then more time would lead to more accurate evaluation of argument strength.

Because all these measures might predict the extent to which people try to imagine possible rebuttals to an argument, they all might help evaluate the strength of the argument. In line with these ideas, we investigated the possibility that these two separate aspects of analytic thinking, focusing on quality and quantity, both contribute positively to argument evaluation. Preregistered Hypotheses 3 and 4 were that analytic thinking styles would positively predict argument literacy (Hypothesis 3), but because Actively Open-Minded Thinking and Intellectual Humility are conceptually close to argument evaluation, we expected the relationships to be stronger to these styles than to the Cognitive Reflection Test or to Need for Cognition (Hypothesis 4).

Drawing on the theoretical assumption that argument evaluation relies on analytic and reflective thinking, we investigated a set of other individual differences variables that might predict argument evaluation. Previous studies indicate that argument evaluation ability is negatively related to reliance on intuitive judgments, which are made with little effort (Svedholm and Lindeman 2013). Along the same lines, being too confident of one’s position seems likely to affect argument evaluation negatively. Thus, we preregistered the hypotheses that argument strength discrimination ability is positively related to the amount of mental effort that one puts into the task (Hypothesis 5) and negatively to overconfidence and an intuitive thinking style (Hypothesis 6).

3 The Present Study

In sum, we investigated whether it is possible to dissociate the abilities to discriminate the strength of arguments depending on what type of argument is involved. Further, we examined how people’s ability to distinguish the strength of arguments relates to demographic background factors, analytic and intuitive thinking styles, mental effort, and overconfidence, and whether the predictors are the same for different types of arguments.

In addition to the study sample, we collected data from an expert sample. The expert sample consisted of ten university professors and lecturers in philosophy, literature, communication, or other related fields, who have published scholarly work on or taught argumentation. This population can be assumed to have some of the most sophisticated understanding of argumentation and most competence to objectively evaluate argument strength. The original, preregistered analysis plan was to use the mean ratings of the expert panel as a standard so that the study participants’ ratings could be scored by correlating them with the mean ratings of the expert panel.

4 Method

4.1 Participants and Procedure

The study was found ethically acceptable by the University of [Helsinki] Ethical Review Board in Humanities and Social and Behavioural Sciences in August 2021. The study was implemented as an online survey in Finnish. Advertisements to participate were widely distributed through a Facebook ad targeted at people in Finland, through student mailing lists, and other social media and general interest discussion forums that are in Finnish. Participants were given the opportunity to review a privacy notice and a participant information sheet. They indicated informed consent by clicking on a link on a starting page that took them to the actual survey page. Because participant recruitment was slower than anticipated, the survey was open for a total of 10 weeks (as opposed to the four weeks planned in the preregistration). The survey presented all measures on one page in the following order: background information (age, gender, self-rated familiarity with argument analysis, education, occupation, field of work or study), Faith in Intuition and Need for Cognition mixed, arguments, Comprehensive Intellectual Humility Scale, Cognitive Reflection Test, Actively Open-Minded Thinking. At the end of the survey, participants were given the opportunity to continue to a follow-up study using think-aloud methodology. Because participation was low (N = 17), the follow-up data were not analyzed for this study.

In total, 343 people completed the study. In line with the preregistered exclusion criteria, we excluded 50 participants who answered one or more of the attention check items (although this did not change the results substantially compared to using the entire sample), and an additional 15 participants because they had more than 20% missing responses on the argument task. The final number of participants included in analyses was 278. Thus, the study had near perfect power (power > .99) to detect large correlations (r > .37), power of .98 to detect medium correlations (r > .24), and power of .38 to detect small correlations (r > .10) at an α level of .05. For detecting within-participant differences, power was .91 to detect small differences (d > 0.20) and > .99 for detecting medium (d > 0.50) or large differences at α < .05.

In the final sample, 60% were female, 31% were male, 3.6% reported another gender, and the rest preferred not to say. The sample represented a wide age range, 17–78. The mean age was 40 years, SD = 15.5 years.

The participants included both people currently working (44%) and students (37%). 18% were otherwise occupied (e.g., retired). The majority had completed tertiary education (58%) or upper secondary education (28%). 10% had graduate degrees while 4% had only secondary education. All major fields of occupation or study were represented, including humanities and arts (34%), social sciences (25%), technology (17%), natural sciences (12%), medicine (9%), agriculture, forestry, and veterinary science (2%) Information was missing for 2%.

4.2 Measures

4.2.1 Cognitive Predictor Variables

To estimate the reliabilities of the predictor scales, we used McDonald’s omega total, which is an appropriate estimator of reliability as it makes few assumptions about the dimensionality of the constructs measured (McDonald 1999; Revelle and Condon 2019). For estimation we used the omega function of the psych package for R (Revelle 2021).

Faith in Intuition was assessed using the 10-item (ω = .82) Intuition subscale of the Multifaceted Rational-Experiential Inventory (REIm; Norris and Epstein 2011). Items were rated on a 5-point scale (1 = completely disagree, 5 = completely agree). An example item is “I enjoy learning by doing something, instead of figuring it out first.”

Need for Cognition was assessed using the 12-item (ω = .89) Rationality subscale of the REIm, which is conceptually equivalent to NFC. Items on this scale refer to the preferred amount of thinking and to its quality, e.g., being “logical” or “analytic.” An example item is “I have a logical mind.”

Intellectual humility was assessed using the Comprehensive Intellectual Humility Scale (CIHS; Krumrei-Mancuso and Rouse 2016). It consists of 22 items (ω = .85), rated on a 5-point scale (1 = strongly disagree, 5 = strongly agree). An example item is “I can have great respect for someone, even when we don’t see eye-to-eye on important topics.”

To assess cognitive reflection, we used three Cognitive Reflection Test (CRT) type items from Thomson and Oppenheimer (2016; items “Emily”, “race”, and “farmer”) and three items from Toplak et al. (2014; items 4: “barrels of water”, 5: “15th lowest/highest”, and 6: “buying and selling a pig”). An example item is “Jerry received both the 15th highest and the 15th lowest mark in the class. How many students are in the class?” The task was scored as the ratio of correct responses (ω = .71).

Actively Open-Minded Thinking (AOT) was assessed using the 7-item (ω = .80) scale of Haran et al. (2013). Participants rated their agreement with each statement. To match the format of the other scales in the study, we used a 5-point scale (1 = completely disagree, 5 = completely agree), instead of the original 7-point rating scale. An example item is “People should revise their beliefs in response to new information or evidence.”

To assess overconfidence, we asked participants, after the CRT, to estimate how many problems they thought they had answered correctly. This estimate minus the actual number of correct responses formed the overconfidence score. A positive difference indicates overconfidence while a negative difference indicates underconfidence.

To assess how much effort participants put into the tasks, we used the Mental Effort Scale of Paas (1992), which consists of one self-rate item asking participants to indicate how much effort they invested in the preceding task. We presented the item twice: after the arguments, and after the Cognitive Reflection Test. The items were scored using seven response options from “very, very low mental effort” to “very, very high mental effort.” Responses to the two presentations of the item (r = .40) were averaged to form a general mental effort score.

4.2.2 Familiarity with Argumentation Analysis

To explore whether familiarity with argumentation analysis affected the ability to evaluate argument strength, the study included the following self-report item: “Have you studied argumentation, critical thinking, or in some other way acquainted yourself with.

argument analysis, for example, by taking courses in logic, argumentation theory, critical thinking, or the like, or by reading literature such as the book Argumentti ja kritiikki [Argumentation and critique] by Kakkuri-Knuuttila?” Responses were given on a five-point scale (1 = not at all, 2 = a little, 3 = to some extent, 4 = quite much, 5 = very much). Most participants reported no (37%) or a little (35%) familiarity with argument analysis. 18% reported “some” familiarity, while only 7% reported “much” and 3% “very much.” However, this variable was not related to argument evaluation, and not analyzed further.

4.2.3 Attention Check

The questionnaire included three attention check items (e.g., “I am sometimes deeply moved by poetic language.”). The instructions were as follows: “Below you will find some statements with the same response scale as the other questions. Do not respond to these statements at all. They are included only for the purpose of checking whether you are reading the instructions and responding to questions carefully.”

4.2.4 Arguments

We created 80 sentences on topics related to society, law, food, health, and lifestyle. Twenty sentences (six with slightly modified wording) were taken from Svedholm-Häkkinen & Hietanen (submitted), the rest were formulated for the present study. Each sentence consisted of a claim followed by an argument. For each of 10 claims, there was one weak and one strong argument from each of the four argument schemes. The strong argument was in line with the critical questions for the argument scheme in question, while the weak one violated the critical questions. In strong causal arguments, the consequences were likely, and in weak causal arguments, they were unlikely. In strong analogous arguments, the point of comparison was similar, and in weak analogous arguments, the point of comparison was different from the target in relevant respects. In strong authority arguments, the authority was credible, while in weak authority arguments, the credibility of the authority was marred by partiality or lack of expertise. Finally, in strong symptomatic arguments, the property that was mentioned was relevant, while in weak symptomatic arguments the property was irrelevant. Thus, each scheme was represented by 10 argument pairs.

All participants rated all arguments, thus approximately matching the various arguments in terms of relevant knowledge about the topic. The presentation order of the arguments was counterbalanced as follows. First, the eight arguments within each of the 10 topics were ordered in one of eight different orders using a balanced Latin square design. Two of the orders were used twice. Next, we selected the first argument from each topic, then the second argument from each topic, then the third etc. This resulted in an order in which the presentation of each argument type was counterbalanced across the list, and the distance between arguments for the same topics was constant (10). To further balance out presentation order, four different permutations of this list were created, by switching the first and second halves and presenting both orders forwards and backwards. The arguments were presented to each participant in one of these four orders. The starting page of the study had a link that directed participants to one of four online questionnaires randomly using JavaScript embedded into the page. Apart from the order of the arguments, the four questionnaires were identical. The instructions at the beginning of the task were:

Next, your task is to rate the strength of arguments, that is, how well the presented justifications support the presented claim.

You may personally disagree with both claims and justifications. However, you should not base your responses on your opinions of the claims or associated issues, but only on how well the presented justifications support the presented claim. In other words, even if you do not agree with the claim, you should still assess how well the justifications support the claim.

That is, the justifications may be good even if you find the claim poor – or vice versa.

The claims have been created for this study and they do not necessarily represent the official opinions of the parties mentioned in these materials. Nevertheless, you should think of the justifications as if they were real. That is, do not assess whether someone has actually stated something, only how well it supports the claim. The sentences vary in linguistic form, as they do in real life. Thus, your task is not to evaluate their linguistic form, only the content.

If you find that the justifications support the claim very weakly, choose −2. If you find that the justifications support the claim very strongly, choose + 2. If you neither strongly disagree nor agree, choose an appropriate option between these extremes.

Halfway through the list of arguments, the attention check was presented. After the attention check, the rest of the arguments were presented and the instructions repeated.

As a measure of the ability to distinguish strong from weak arguments, we computed the difference between the evaluation of the strong and the weak argument within each of the 40 argument pairs for each participant. We used these difference scores as the main unit of analysis. To form measures of the ability to evaluate each type of argument, we averaged the difference scores on each of the four 10-pair “subscales” corresponding to the argument schemes. Missing data in the argument pairs was imputed by filling in the mean response of that participant for that argument scheme.Footnote 2

4.2.5 Expert Ratings of Arguments

In a separate pilot study, an expert panel consisting of ten university professors and lecturers in philosophy, literature, communication, or other related fields, who have published scholarly work on or taught argumentation, rated the same arguments on the same scale that was used by the participants.

4.3 Analysis

The data were analyzed using R (R Core Team 2022) and the psych (Revelle 2021) and effectsize (Ben-Shachar et al. 2020) packages for R. The data and R code are available at https://osf.io/ayc5p/?view_only=96d26965b5f743a58f7b5b9d47c8fdf1.

5 Results

Table 1 shows descriptive statistics for the studied variables.

Table 1 Descriptive statistics for the studied variables

5.1 Argument Evaluations by Study Participants and Experts

Exploring individual arguments, we found that the participants rated the strength of most arguments in line with our intentions, with 90% of the purportedly weak arguments given average ratings below the scale midpoint, and 63% of purportedly strong arguments above the scale midpoint. The average rating for the strong arguments was + 0.19, while the average rating for the weak arguments was −1.11, on the scale ranging from −2 to + 2. Within all argument pairs, the relative pattern was in line with our intentions: both experts and participants found that the purportedly strong argument was stronger than the purportedly weak argument. These differences were statistically significant in the majority of pairs (see Appendix for details). On average, the participants gave the strong argument within a pair a rating 1.30 points higher than the weak one (SD = 0.35). For experts, the mean difference between strong and weak was 1.61. The Appendix shows means and deviations of all argumentsFootnote 3.

As an exploratory analysis, we also inspected the ratings averaged for each type of weak and strong argument. Table 2 shows descriptive statistics for each argument type among for experts and study participants. As the table shows, the expert ratings support the validity of the arguments, as the mean expert rating for each type of purportedly strong argument was positive and the mean for each type of purportedly weak argument was negative. On weak arguments, the participants largely seemed to agree with the experts. In contrast, the participants seemed to rate all types of strong arguments, as well as weak authority arguments, more negatively than the experts, so far as to rate the purportedly strong analogous and authority arguments below the scale midpoint. The average correlation between participants’ ratings and the mean ratings of the expert panel across all 80 arguments was r = .62.

Table 2 Mean ratings of strong and weak items of each type by experts and study participants

5.2 Unitary Ability or Separate Dimensions

To investigate RQ1 (whether argument evaluation ability is dissociable across types of arguments), we analyzed the dimensionality of the difference scores (strong minus weak) for the forty argument pairs. These variables correlated with each other, although not highly. The reliability (McDonald’s omega) of the 40-pair scale was .71.

Several analyses indicate that, to a first approximation, one factor is sufficient to explain all inter-pair correlations. The scheme-specific scales seem to provide very little information. First, the average difference scores computed for each of the four argument schemes (the four “subscales”) correlated with each other moderately to strongly (r’s = [.18, .45], all p’s < .001, corrected for multiple analyses). These correlations were limited because the subscales themselves contained error of measurement, with reliabilities (ω): causal .46, analogous .35, authority .65, symptomatic .52.

Second, although the reliability coefficient for the 40-pair scale was only .71 (as noted above), Revelle’s measure of unidimensionality (unidim) was .96. The unidimensionality measure is the ratio of the correlation matrix implied by a unidimensional model, and the observed correlation matrix minus the uniquenesses. The higher this ratio is, the more evidence there is for unidimensionality (Revelle 2021). In sum, although the fit of a one-dimensional model is poor, the problem is due to unique determinants of individual pairs rather than an incorrect model.

Third, as a further test of whether the four subscales capture meaningful dimensions in the data, we fitted a one-factor model to the data (i.e., we standardized the ratings of the arguments to give each item equal weight, and used the first component of a principal components analysis of the standardized ratings) and examined the residuals and their correlations. Specifically, we tested the alpha reliability of these residuals (as alpha is a measure of correlations between items). If the residuals of items from a subscale showed high alpha reliability, it would indicate that the residuals have a common underlying cause other than the extracted factor. Moreover, we tested whether the residual of each item correlated with the mean score of the remaining items in the scale that this item assumably belongs to (the item-total correlations). Table 3 shows the results for standardized alpha and item-total correlations. Apparently, the individual subscales do capture a small amount of variance not accounted for by the one-dimensional model, especially the Authority and Causal subscales. Two of the α values were significantly different from zero at p < .052: Authority and Causal (p = .011 and p = .031 by bootstrap).

To examine the Authority scale more closely, we looked at the 45 correlations of each of the 10 pairs with every other pair in that scale. Seven of these were clear outliers on the high side, separated by a large gap (U1-U5, U1-U10, U2-U3, U2-U4, U3-U4, U3-U6, U5-U8; see pair numbers in the Appendix). Examination of these pairs yielded no single explanation of all of them. A couple of these high correlations seemed to involve trust in government, since one of the repeated authorities for “strong” arguments was a government body. Similarly, we examined the residual correlations of the pairs in the Causal scale. However, there was no discernible pattern in these correlations despite them being significantly different from zero.

Table 3 Analysis of residuals from a one-dimensional model

In sum, although much of the variance is specific to individual argument pairs (hence “noise” in our analyses), it seems that measures of sensitivity to argument strength do not form dimensions corresponding to different kinds of arguments. Thus, we used scores on the overall measure (the full 40-pair score) in the remaining analyses.

5.3 Factors Related to Argument Strength Discrimination Ability

To analyze Hypotheses 3–6, we investigated the bivariate correlations of argument evaluation ability and the other studied variables. These are shown in Table 4. We used disattenuated correlations, which take into account measurement error and show what the correlations would be if the constructs were measured with perfect reliability (Revelle 2022, Ch. 7).

Of interest in Table 4 is that the quality and quantity aspects of thinking seemed to contribute equally to argument strength discrimination ability (in line with Hypothesis 3 about analytic thinking in general predicting argument strength discrimination ability and against Hypothesis 4 about the quality aspect being a stronger predictor than the quantity aspect), with similar correlations between the argument strength discrimination measure and Need for Cognition and Cognitive Reflection Test performance on the one hand, and the Actively Open-Minded Thinking and Comprehensive Intellectual Humility Scale on the other hand. However, while both the CRT and the AOT overlapped moderately with NFC, they were only weakly related to each other. The CRT was also unrelated to CIHS, while the other two analytic thinking dispositions showed stronger positive relationships to it.

Unexpectedly, argument strength discrimination was unrelated to an intuitive thinking style, and to the self-reported mental effort put into the evaluations (against Hypotheses 5 and 6 about mental effort predicting argument strength discrimination ability positively, and overconfidence and an intuitive thinking style predicting it negatively). There was a slight negative relationship with overconfidence, but note that overconfidence is a composite measure that takes into account CRT accuracy, and overconfidence is largely determined by accuracy.

Argument strength discrimination was unrelated to age but increased a little with completed educational degree. No differences were found across fields of study or work, or across occupational groups (one-way ANOVAs: p’s = .15 and .08). Male participants had slightly weaker argument strength discrimination than female participants did.

Lastly, to examine how well the argument strength discrimination measure can be predicted from the four analytic thinking style constructs measured by the Actively Open-Minded Thinking scale, Need for Cognition scale, Cognitive Reflection Test, and Comprehensive Intellectual Humility Scale, we used a multiple regression analysis that was calculated using disattenuated correlations. This analysis led to highly similar conclusions as inspection of the bivariate correlations. The β coefficients were as follows. AOT: β = .18; NFC: β = .17; CRT: β = .18; CIHS: β = .15. The multiple regression overall had an R2 value of .19, which corresponds to a multiple correlation of .44 (the correlation between predicted and obtained scores), suggesting some overall relationship with analytic cognitive style.

Table 4 Correlations among the studied variables, scale reliabilities (omega; on the diagonal), and disattenuated correlations above the diagonal

6 Discussion

The ability to tell poorly justified from well-justified arguments is one of the core components of argument literacy. Argument strength can be determined in different ways. Regarding the specific relationship between single arguments and standpoints, Pragma-Dialectic argumentation theory focusses on critical questions, which depend on the argument scheme in use (Hitchcock and Wagemans 2011; van Eemeren and Grootendorst 1992). The present study investigated whether the ability to distinguish the strength of one type of argument generalizes to other types of arguments. Further, we investigated a set of cognitive factors that were theoretically expected to predict these abilities.

6.1 Is Argument Strength Discrimination Dimensional?

Framing our study using the concept of argument schemes enabled us to systematically examine different types of arguments that appeal to different types of justifications. We were able to ask whether the cognitive determinants of argument-strength discrimination might differ between types of arguments. However, the present data indicated that, to a first approximation, individual differences in argument-strength discrimination were general across four different argument schemes.

Aside from the single general factor that affected all arguments, some pairs of authority arguments seemed to be affected by other determinants. On this type of arguments, strength depends on trust in the various authorities. Trust can vary as a function of several determinants, such as attitudes toward particular sources (pro- or anti-government, for example), concern with conflict of interest (when authorities advocate whatever is best for their group), true expertise vs. formal responsibility, signs of actively open-minded thinking by the authority in question (Baron 2019), as well as relevance of the authority to the topic. Variation in these factors likely explained the correlations between the authority arguments. Future efforts might try to hold constant all factors except relevance, which is the one most like the determinants of strength in other types of arguments.

6.2 Complementary Effects of Thinking Things Through, and Thinking of Both Sides of Issues

Regarding the studied predictors, the present study indicated that argument strength discrimination ability was not related to demographic factors such as age or field of work, but it had meaningful associations with cognitive factors. Overall, argument evaluation had a positive relationship with analytic thinking dispositions. The present findings indicate that two partly dissociable aspects of analytic thinking dispositions have complementary positive relationships to the ability to evaluate argument strength. On the one hand we have dispositions such as Need for Cognition and Cognitive Reflection, which are concerned with the quantity of thinking: enjoyment of thinking, seeking thinking activities, concern with putting effort into thinking before making judgments, and a greater concern with accuracy and less concern with speed (Frederick 2005; Petty et al. 2009).

On the other hand, we have dispositions that are concerned with the quality of thinking. This aspect is captured by constructs such as Actively Open-Minded Thinking and Intellectual Humility, and by measures of overconfidence. Of note, these differ from the former in that the standards of AOT do not imply that more thinking is always better; rather these standards are consistent with the belief that it is sometimes better to turn to trustworthy sources than to try to decide on one’s own (Baron et al. 2023). The AOT scale that we used is almost entirely about willingness to look for reasons why a favored conclusion might be wrong, a process that can begin immediately and does not require extensive thought (although it may be found more often when thought is extensive). Similarly, intellectual humility involves accepting that one’s knowledge and cognitive faculties are limited and imperfect (Krumrei-Mancuso and Rouse 2016). Such imperfection may result from not wanting to bother to work hard enough to reach a clear conclusion. In the present data, as in previous studies (Baron et al. 2015; Janssen et al. 2019; Swami et al. 2014), NFC overlapped with both AOT and CRT, while CRT and AOT were largely independent of each other, supporting the notion of two related, but partly dissociable aspects of analytic thinking dispositions.

6.3 Some Considerations Regarding the Possible Intuitiveness of Argument Strength Discrimination

By the theoretical account indicated by these findings, distinguishing argument strength is a task that requires that one thinks extensively, that one is cognizant that even arguments that are against one’s personal opinions may have merit, or both. Therefore, it was surprising that argument strength discrimination ability was not negatively related to the tendency to trust one’s first impressions and also unrelated to the amount of effort that participants reported that they had put into the task.

However, that putting more effort into the task did not improve performance may be understood as being in line with recent research on formal reasoning suggesting that better reasoning may be thanks to more accurate intuitions rather than expended effort (Thompson and Johnson 2014; Thompson et al. 2018). Similarly, Mercier and Sperber (2011) assert that argumentation is based on intuitive rather than analytic thinking: “Intuitions about arguments have an evaluative component: Some arguments are seen as strong, others as weak” (p. 59). However, this assertion does not account for the moderately strong positive relationships of argument strength discrimination and analytic thinking dispositions found in the present study. Moreover, a recent study found that argument evaluations given under time pressure (a condition thought to bring out intuitive judgments) and without time pressure were dissociable (Hornikx et al. 2022). Thus, more research is needed to clarify the role of intuitive judgments in argument evaluation.

6.4 What to Teach if We Want to Improve Argument Strength Discrimination

Turning now to the question of how to teach argument evaluation, argument evaluation seems to be affected by at least three types of factors. First, previous research indicates that one of the strongest determinants of argumentation ability is general cognitive ability (Voss and Means 1991). This likely consists partly of general capacities that cannot be modified by education, such as mental speed and attentional capacity (Baron 1987).Footnote 4 Second, our results suggest that argument evaluation is affected by general cognitive styles that presumably affect all thinking. Considerable evidence indicates that general cognitive styles such as AOT can be influenced by education (e.g., Nickerson et al. 1985/2014; Perkins 2019), and perhaps even more education broadly defined to include what comes from families and acquaintances as well as schools (Baron et al. 2023). As Baron (1985, p. 15) puts it, “Although you cannot improve working memory by instruction, you can tell someone to spend more time on problems before she gives up, and if she is so inclined, she can do what you say.” However, the correlations we found between argument strength discrimination and these styles were not high even after corrections for imperfect reliability of measures. Thus, it makes sense that other determinants were present.

Yet there is a third element to thinking, consisting of more specific “mindware” (Perkins 1995; Perkins et al. 1993; Stanovich 2018), that is, useful cognitive tools that are somewhat specific, such as applicable knowledge of logic and basic statistical inference. These tools might affect evaluation of different types of arguments to different degrees. Knowledge of logic and statistical reasoning might be particularly useful in evaluation of causal arguments. For arguments from authority, evaluation is dependent on specific cultural knowledge about the authorities in question. Thus, teaching argument evaluation will somewhat involve the specific mindware. This is similar, but not identical, to teaching students to attend to factors that are relevant for answering the critical questions associated with each argument scheme.

The present data invite a few other observations. The experts rated most arguments like we had intended, supporting the validity of the study materials as representing relatively strong and relatively weak arguments. Examining the results closely revealed that while the participants largely distinguished the weakness in weak arguments as well as experts did, most differences between study participants and experts arose in evaluations of the strong arguments. That is, while the experts acknowledged that the strong arguments were strong, the participants tended to rate even the strong arguments as being weak. Thus, it is possible that developing expertise in analyzing arguments is not so much about being increasingly critical towards poor arguments. Instead, it may be equally or more important to learn to recognize the signs of well-justified arguments.

6.5 Caveats, Limitations, and Future Directions

By structuring our approach on argument schemes, we strove to cover a broad variety of arguments. We focused on the three top-level schemes described by Pragma-Dialectic argumentation theory: causal, analogous, and symptomatic arguments (van Eemeren et al. 2004). In addition, we examined authority arguments, which are a subtype of symptomatic arguments, because these have received previous attention in psychology. However, our set of argument types was not exhaustive. For example, as a reviewer pointed out, all of the causal arguments that we used were of the subtype of pragmatic arguments (“something should be done”). In further research, it may be fruitful to replicate the study using different typologies of arguments, as it is possible that they would show signs of dimensionality, which the presently studied categorization did not show.

Moreover, one aspect of causal arguments that we did not control for was the desirability of the consequences. Studies have shown that desirability affects evaluations even more than likelihood does (O’Keefe 2012). Even though desirability is not part of the critical questions formulated in the argumentation literature, future studies would do well to take it into account due to its large practical effect. Similarly, future studies should ask for the participant’s own views or prior knowledge about the issues that the arguments concern, because these are known to affect evaluations strongly (belief bias or myside bias; McCrudden and Barnes 2016; Taber and Lodge 2006).

Finally, practical argumentation is a highly contextual activity. In real life, the soundness of arguments is dependent on context (Hahn 2020; Hitchcock and Wagemans 2011; van Eemeren and Grootendorst 1992). Although the claims and their arguments in the present study were decontextualized from their immediate context (i.e., a political debate or a letter to the editor), some of them were still highly dependent on their wider cultural context. Most clearly, the value of different authorities is impossible to determine unless one has knowledge about them. Other types of arguments may be more universal and function equally well in other countries and languages. Thus, researchers working in other cultural or linguistic settings wishing to use our approach to assessing argument literacy may need to adjust the individual arguments.

7 Conclusion

The present study integrated argumentation research with the psychology of thinking and reasoning by investigating the cognitive determinants of individual differences in the ability to distinguish argument strength. The results indicated that argument strength discrimination ability did not differ across four major argument types, and that participants with more analytic thinking dispositions, emphasizing both quality and quantity of thinking, were better at discriminating weak from strong arguments across all major argument types.