The Likert scale (Likert, 1932) is widely used to measure respondents’ psychological characteristics. However, it is also known that the data obtained from a Likert scale tend to be affected by response biases (for more details and examples, see e.g., Chen, Lee, & Stevenson, 1995; Ferrando & Lorenza-Seva, 2010; He, Bartram, Inceoglu, & Vijver, 2014; Kam & Meyer, 2015; Liu, Harbaugh, Harring, & Hancock, 2017; Phelps, Schmitz, & Boatright, 1986; Rammstedt & Farmer, 2013; Salgado, 2016; Thunholm 2001; Usami, Sakamoto, Naito, & Abe, 2016; Viswesvaran & Ones 1999). These response biases are considered as one of the main sources of measurement error; therefore, they may harm the validity of the measurement (Podsakff, MacKenzie, Lee, & Podsakoff, 2003).

One promising solution for dealing with response biases is to move away from the Likert measurement and instead present items in a comparative manner. In the two-alternative multidimensional forced-choice (2MFC) format, two paired statements are simultaneously presented as a questionnaire item. Respondents are asked to choose one of the two statements that best describes themselves. An example of 2MFC items is the pair of statements “Get stressed out easily” and “Don’t talk a lot.” Each statement in an item corresponds to different dimensions of psychological traits to be measured, such as emotional stability and extraversion in the preceding example. The 2MFC format should reduce the impact of uniform response biases because it makes it impossible to process uniformly elevated or decreased judgment across all items (Brown & Maydeu-Olivares, 2018). In addition, there is evidence that the use of the forced-choice format can also reduce the influence of socially desirable responding and improve predictive validity (e.g., Christiansen, Burns, & Montgomery, 2005; Jackson, Wroblewski, & Ashton, 2000; Salgado & Táuriz 2014).

However, when we code the 2MFC responses as binary scores, i.e., 1 for chosen and 0 for unchosen statements, the resultant variables become ipsative. Ipsative data suffer from statistical problems such as distorted validity (for more details, see e.g., Brown 2010; Brown & Maydeu-Olivares, 2011; Hicks, 1970; Johnson, Wood, & Blinkhorn, 1988).

A promising solution to the problem of ipsative data is to apply the framework of item response theory (IRT; Rasch, 1960; Lord, 1980) instead of summing the binary-coded scores. For this purpose, Brown and Maydeu-Olivares (2011) proposed a novel and general IRT model of forced-choice format observations called the Thurstonian IRT model by extending Thurstone’s Law of Comparative Judgment (Thurstone, 1927) to an IRT formulation. By incorporating the response process that is appropriate for forced-choice data, the Thurstonian IRT model succeeded in simultaneously estimating both item and respondent parameters and is free from the abovementioned problems of binary-coded ipsative data. To date, Thurstonian IRT is considered to be the only model that can be readily applied to data collected with existing multidimensional forced-choice format questionnaires, with the additional objectives of estimating item parameters and correlations between the latent trait dimensions (Brown & Maydeu-Olivares, 2018).

For ordinary Likert-scale measured personality data, another extension of the IRT model has recently attracted researchers’ attention: the joint modeling of the item response and response time. The incorporation of the response time is particularly important because, nowadays, more and more questionnaires are conducted with computers instead of paper and pencils. In computer-based measurement, the response time information can be obtained with little additional cost along with the item responses. Various studies have discussed the utility of incorporating response time information in, e.g., classifying response behaviors, removing aberrant responses, and improving the estimation accuracy (e.g., Bertling and Weeks, 2018; Kong, Wise, & Bhola, 2007; Wang, Xu, & Shang, 2018; Wise & Kong, 2005). The general validity of online measurement has been demonstrated for both personality characteristics and response times (e.g., Condon, 2018; Raz, Bar-Haim, Sadeh, & Dan, 2014).

The D-diffusion IRT model (Tuerlinckx & De Boeck, 2005) is a promising cognitive psychometric model (Batchelder, 1998; Vandekerckhove, 2014; Ranger et al., 2017) for the joint modeling of item responses and response times for personality questionnaire items. It combines two key modeling ideas: IRT models that have the virtue of disentangling respondent and item characteristics and cognitive process models that are grounded in cognitive theory to implement validated mechanisms for performing response tasks. Because IRT models were originally developed for ability measurement, many response-time-incorporated IRT models assume that the respondent is more skilled when they answer an item faster. However, the response time in personality measurement should have different characteristics. Namely, the response time tends to become longer when the relative position of a latent respondent parameter is close to the corresponding item parameters, regardless of the direction; this phenomenon has been called the distance–difficulty hypothesis (Kuncel, 1973; Ferrando & Lorenzo-Seva, 2010). This relationship is analogous to a well-known psychophysical result that the uncertainty of responding to an stimulus is maximum when the stimulus is near the respondent’s psychological threshold (Ferrando, 2006). To date, the distance–difficulty hypothesis has been a leading theory for explaining the response time of personality measurement items (Molenaar & Bolsinova, 2017).

The D-diffusion IRT model, which will be mathematically introduced in the next section, implements this distance–difficulty relationship of personality measurement in a natural and quantitative manner. The model has a good parameter recovery property (Ranger, Kuhn, & Szardenings, 2016) and has begun to be applied to empirical psychological data such as the measurement of extraversion (Molenaar, Tuerlinckx, & Maas, 2015). Furthermore, by explicitly taking into account the cognitive process underlying the observed responses, the D-diffusion IRT model allows us to obtain psychologically meaningful parameter estimates such as the speed of information uptake and the amount of information used to make a decision. This helps us to quantify and empirically test psychological theories in an applied context (van Ravenzwaaij & Oberauer, 2009).

However, so far, Thurstonian IRT and D-diffusion IRT models have taken two different pathways of model development. In fact, the authors are not aware of any cognitive psychometric models that can jointly model the item response and response time information of personality measurements collected in the 2MFC format. Consequently, the primary objective of the current study is to propose and examine such a model by integrating the Thurstonian IRT and D-diffusion IRT models. A Bayesian modeling approach (Lee & Wagenmakers, 2013) is adopted in the proposed method.

The remainder of this paper consists of five sections. In the “Existing models” section, we review the abovementioned existing models with more mathematical depth. Then, we introduce the proposed model, which is named the Thurstonian D-diffusion IRT model, in the following section. The “Simulation study” section describes a simulation study that was conducted to check parameter recovery. The “Real data application: Big-Five data” section provides the application of the proposed Thurstonian D-diffusion IRT model to real psychological data to demonstrate that it successfully combines the item response and response time information. Finally, a general discussion and possible directions for further research are provided in the “Discussion” section.

Existing models

Diffusion model

The following descriptions of the diffusion and D-diffusion IRT models are adapted and modified from Bunji and Okada (2019). The diffusion model (Ratcliff, 1978) is a model of the cognitive process that underlies a respondent’s response in decision making tasks (for reviews, see Ratcliff & McKoon, 2008; Voss, Nagler, & Lerche, 2013). This model decomposes the observed item response and response time into several different psychologically meaningful parameters. One of the major advantages of the diffusion model is that it addresses the tradeoff between the response accuracy and the response speed thanks to this decomposition. This facilitates the quantification of individual differences in the cognitive processes. Figure 1 shows a schematic of the diffusion model. When an item is presented to the respondent, the cognitive information required to respond to the item accumulates over time from the starting point (z). Once it reaches the upper (a) or lower (0) boundary, the respondent answers the item. The diffusion model was originally aimed at two-alternative forced-choice tasks, and the upper and lower boundaries correspond to each of these options. The increase in the amount of information in a unit time is assumed to follow a normal distribution with mean v and variance s2.

Fig. 1
figure 1

Schematic of the diffusion model (figure adapted from Bunji & Okada, 2019)

The diffusion model has four main parameters. a represents the distance between the upper and lower boundaries. A larger value of a means that a longer time is required to answer the item, which suggests that the respondent’s choice is more deliberate and cautious in responding. z is the starting point. When there is no initial bias towards either of the two boundaries, z equals a/2. v represents the average slope of the information accumulation process. The process approaches the upper (resp. lower) bound when v is positive (resp. negative). For identification of scale, the variance s2 is commonly fixed as s2 = 1. τ represents the nonresponse time, i.e., the duration of nondecisional processes that may comprise basic encoding processes.

D-diffusion IRT model

Tuerlinckx and De Boeck (2005) extended the diffusion model to the IRT framework, followed by its generalization by van der Maas, Molenaar, Maris, Kievit, and Borsboom (2011). Broadly speaking, there exist two typical occasions for which IRT models are used for psychological measurement: personality measurement, e.g., for the Big-Five traits, and ability measurement, e.g., for college entrance examinations. Corresponding to these, two types of diffusion IRT models with different functional forms have been developed: the D- and Q-diffusion IRT models for the former and latter, respectively (Tuerlinckx & De Boeck, 2005; van der Maas et al., 2011). Because our interest is in personality measurement, we consider the D-diffusion IRT model in this paper.

Let xij be the binary observed item response that takes values of 1 when respondent i chooses the first option for item j (or the accumulation reaches the upper bound) and 0 when the second option is chosen (the accumulation reaches the lower bound). Moreover, let tij be the corresponding response time. When xij and tij are generated by the diffusion model, the joint distribution of the item response (xij) and response time (tij) can be expressed as

$$ \begin{array}{@{}rcl@{}} f(x_{ij},t_{ij})&=&\frac{\pi s^{2}}{a_{ij}^{2}}\exp\left( \frac{(a_{ij} x_{ij} - z_{ij})v_{ij}}{s^{2}}-\frac{v_{ij}^{2}}{2s^{2}}(t_{ij}-\tau_{j})\right)\\ &\times& \sum\limits_{m=1}^{\infty}m\sin\left( \frac{\pi m(a_{ij} x_{ij}-2z_{ij}x_{ij}+z_{ij})}{a_{ij}}\right)\\ &\times&\exp\left( -\frac{1}{2}\frac{\pi^{2}s^{2}m^{2}}{a_{ij}^{2}}(t_{ij}-\tau_{j})\right). \end{array} $$
(1)

In typical personality measurement, it would be reasonable to assume that there is no response bias at the starting point. Under the assumption that zij is set to aij/2, which indicates no initial response bias, the probability that respondent i chooses the first option for the j-th item is given as

$$ P(x_{ij}=1)=\frac{\exp(-2z_{ij}v_{ij})-1}{\exp(-2a_{ij} v_{ij})-1}=\frac{\exp(a_{ij} v_{ij})}{1+\exp(a_{ij} v_{ij})}, $$
(2)

which is a form of the logistic curve.

In the D-diffusion IRT model, the diffusion model parameters are decomposed into the respondent and item IRT parameters. Specifically, the diffusion model parameters are expressed as

$$ \left\{\begin{array}{ll} a_{ij}=\displaystyle\frac{\gamma_{i}}{\xi_{j}} &\text{with} \gamma_{i}, \xi_{j} \in \mathbb{R}_{>0} \\ v_{ij}=\theta_{i}-b_{j} &\text{with} \theta_{i}, b_{j} \in \mathbb{R}. \end{array}\right. $$
(3)

Notably, vij is expressed as the difference between the item difficulty and the respondent’s latent trait, hence the name D-diffusion IRT. Further, the distance between the upper and lower boundaries, aij, is given as the ratio of the respondent properties (e.g., attentiveness and deliberateness) and the item properties (e.g., complexity and length). Such a separation between item and respondent parameters is a key characteristic of IRT models. This is in contrast to the original diffusion model, in which the separation between the two was not explicitly taken care of.

As a result, Eq. 2 is rewritten as

$$ P(x_{ij}=1|\theta_{i},\gamma_{i})=\frac{\exp\left( \displaystyle\frac{\gamma_{i}}{\xi_{j}} (\theta_{i}-b_{j})\right)}{1+\exp\left( \displaystyle\frac{\gamma_{i}}{\xi_{j}} (\theta_{i}-b_{j})\right)}, $$
(4)

which is the form of the two-parameter logistic IRT model except that the discrimination parameter γi/ξj depends on both the respondent and item. τj retains the same meaning as in the original diffusion model, and we considered τj as an item parameter in this study. Owing to a property of the diffusion model, the expected response time is the longest when vij = 𝜃ibj = 0. This means that, when the respondent’s latent trait is closer to the item difficulty, the expected response time becomes longer, which is the expectation of the distance–difficulty hypothesis. In this way, the D-diffusion IRT model generates important predictions for personality measurement.

Thurstonian IRT model

As previously stated, the Thurstonian IRT model (Brown & Maydeu-Olivares, 2011) extends Thurstone’s Law of Comparative Judgment (Thurstone, 1927) to the IRT framework. Let j1 and j2 be paired statements presented as the j-th item, and let xij be the observed binary response of respondent i that takes values of 1 when j1 is chosen and 0 otherwise. In addition, let \(u_{ij_{1}}\) and \(u_{ij_{2}}\) be the latent utilities of respondent i for j1 and j2, respectively. In the Thurstonian IRT model, the observed response is modeled to be determined by the difference between \(u_{ij_{1}}\) and \(u_{ij_{2}}\) of that respondent; that is,

$$ x_{ij} = \left\{\begin{array}{ll} 1 & \text{if} u_{ij_{1}} \geq u_{ij_{2}} \\ 0 & \text{if} u_{ij_{1}} < u_{ij_{2}}. \end{array}\right. $$
(5)

This means that the observed item response depends on the sign of the difference between the latent utilities: \(x_{ij}^{*} = u_{ij_{1}}-u_{ij_{2}}\). In the 2MFC measurement of personality, each statement is designed to measure different trait dimensions; for instance, the first statement measures emotional stability, and the second statement measures extraversion. Let \(m_{j_{1}}\) and \(m_{j_{2}}\) denote the latent trait dimensions that are measured by j1 and j2, respectively, where M is the total number of latent traits to be measured in the questionnaire, \(1 \le m_{j_{1}} \le M, 1 \le m_{j_{2}} \le M\), and \(m_{j_{1}} \neq m_{j_{2}}\).

Here, \(u_{ij_{1}}\) and \(u_{ij_{2}}\) can be written as the Thurstonian factor model:

$$ \begin{array}{@{}rcl@{}} u_{ij_{1}} &=& \mu_{j_{1}} + \lambda_{j_{1}} \eta_{im_{j_{1}}} + \epsilon_{ij_{1}}, \\ u_{ij_{2}} &=& \mu_{j_{2}} + \lambda_{j_{2}} \eta_{im_{j_{2}}} + \epsilon_{ij_{2}}, \end{array} $$
(6)

where \(\mu _{j_{k}}\) represents the mean of the latent utility \(u_{ij_{k}}\), \(\lambda _{j_{k}}\) represents a factor loading, \(\eta _{im_{j_{k}}}\) represents the latent trait of respondent i, and \(\epsilon _{ij_{k}}\) represents the residual or unique factor. Both ηi and 𝜖i are assumed to be (multivariate) normally distributed.

In order to facilitate understanding of the notation, let us consider the example in Table 3, in which M = 5 latent traits are measured using J = 25 items (i.e., pairs of statements). In item 2, the first and second statements measure agreeableness and intellect/imagination, respectively. They correspond to the third and fifth traits (as indicated in the footnote of the table). Therefore, in this case, \(m_{2_{1}}=3\), and \(m_{2_{2}}=5\). As a result, Eq. 6 for item 2 in Table 3 is given as

$$ \begin{array}{@{}rcl@{}} u_{i2_{1}} &=& \mu_{2_{1}} + \lambda_{2_{1}} \eta_{i3} + \epsilon_{i2_{1}}, \\ u_{i2_{2}} &=& \mu_{2_{2}} + \lambda_{2_{2}} \eta_{i5} + \epsilon_{i2_{2}}. \end{array} $$
(7)

Brown and Maydeu-Olivares (2011) proposed their Thurstonian IRT model by reparameterizing the above second-order Thurstonian factor model. Specifically, its item response function is given by

$$ P(x_{ij}=1|\boldsymbol{\eta}_{i}) = \boldsymbol{\Phi}\left( \frac{\lambda_{j_{1}} \eta_{im_{j_{1}}}-\lambda_{j_{2}} \eta_{im_{j_{2}}} -\delta_{j} }{\sqrt{\psi^{2}_{j_{1}}+\psi^{2}_{j_{2}}}}\right), $$
(8)

where Φ(⋅) denotes the cumulative distribution function of the standard normal distribution, δj represents the difference between the means of the latent utility \(\mu _{j_{k}}\), and \(\psi ^{2}_{j_{k}}\) represents the variance of the unique factor \(\epsilon _{ij_{k}}\). By using the reparameterization

$$ \alpha_{j}=\frac{\delta_{j}}{\sqrt{\psi^{2}_{j_{1}}+\psi^{2}_{j_{2}}}}, {\upbeta}_{j_{1}} = \frac{\lambda_{j_{1}}}{\sqrt{\psi^{2}_{j_{1}}+\psi^{2}_{j_{2}}}}, {\upbeta}_{j_{2}} = \frac{\lambda_{j_{2}}}{\sqrt{\psi^{2}_{j_{1}}+\psi^{2}_{j_{2}}}}, $$
(9)

Eq. 8 can be rewritten as

$$ \begin{array}{@{}rcl@{}} P(x_{ij}=1|\boldsymbol{\eta}_{i}) = \boldsymbol{\Phi} \left( {\upbeta}_{j_{1}}\eta_{im_{j_{1}}}-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j}\right), \end{array} $$
(10)

which corresponds to the form of the two-dimensional normal ogive IRT model. When an item consists of three or more statements, i.e., when we need to include comparisons between more than two statements simultaneously, αj and \({\upbeta }_{j_{k}}\) become mathematically dependent parameters. However, when an item consists of only two statements, which is the case considered in this paper, we can simply estimate them as free parameters (Brown & Maydeu-Olivares, 2011).

Thurstonian D-diffusion IRT model

Obviously, Eqs. 4 and 10 have similar functional forms, as a logistic function can be considered as an approximation to the normal ogive function. In addition, both respondent parameters, 𝜃i in the diffusion IRT model and ηi in the Thurstonian IRT model, have a similar empirical meaning and the same distributional form. Therefore, when combining these two models, we can simply replace (𝜃ibj) in Eq. 4 with (\({\upbeta }_{j_{1}}\eta _{im_{j_{1}}}-{\upbeta }_{j_{2}}\eta _{im_{j_{2}}}-\alpha _{j}\)) in Eq. 10. However, these two models have different forms for the discrimination parameter. In the diffusion IRT model, the discrimination parameter is the quotient of the respondent parameter γi divided by the item parameter ξj. Therefore, in the diffusion IRT model, the discriminability depends on both item factors such as the statement length and time limit and respondent factors such as the response caution and temper (van der Maas et al., 2011). On the other hand, the slope and intercept parameters in the original Thurstonian IRT model are statement and item parameters, respectively. Therefore, the discriminability in the Thurstonian IRT model is not affected by respondent factors.

On the basis of the above fact, in introducing the proposed model, we start from the Thurstonian IRT model and extend it so that the slope and intercept parameters in Eq. 10 depend on both the item and respondent factors. The Thurstonian IRT model maintains its identifiability even if we simply multiply \(({\upbeta }_{j_{1}}\eta _{im_{j_{1}}}-{\upbeta }_{j_{2}}\eta _{im_{j_{2}}}-\alpha _{j})\) in Eq. 10 by (ξj/γi) because \({\upbeta }_{j_{k}}\) is a statement parameter; that is, each statement independently has its own \({\upbeta }_{j_{k}}\). This parameter may be interpreted as the discriminability of the specified statement. On the other hand, ξj is an item parameter that is specific to the pair of statements. This means that ξj represents an effect that is inherent to the paired statements that are simultaneously presented to respondents.

The above arguments lead us to introduce the proposed Thurstonian D-diffusion IRT model, which we develop as an extension of the D-diffusion IRT model. Specifically, in the proposed model, the joint distribution of the item response and response time is given as Eq. 1. However, in its marginalized IRT-type form of Eq. 2, we reparameterize the two parameters aij and vij in the same manner as in the case of Thurstonian IRT as

$$ \left\{\begin{array}{ll} a_{ij}=\displaystyle\frac{\gamma_{i}}{\xi_{j}} &\text{with} \gamma_{i}, \xi_{j} \in \mathbb{R}_{>0} \\ v_{ij}={\upbeta}_{j_{1}}\eta_{im_{j_{1}}}-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j} &\text{with} \eta_{im_{j_{k}}}, {\upbeta}_{j_{k}}, \alpha_{j} \in \mathbb{R}, \end{array}\right. $$
(11)

instead of Eq. 4.

Then, the joint likelihood of the item response and response time in the proposed model is given as

$$ \begin{array}{@{}rcl@{}} f(x_{ij},t_{ij})\!&=&\!\frac{\pi s^{2}{\xi_{j}^{2}}}{{\gamma_{i}^{2}}}\exp\left( \frac{\gamma_{i}({\upbeta}_{j_{1}}\eta_{im_{j_{1}}} - {\upbeta}_{j_{2}}\eta_{im_{j_{2}}}\!-\alpha_{j})(2x_{ij} - 1)}{2s^{2}\xi_{j}}\right.\\&&\left.-\frac{({\upbeta}_{j_{1}}\eta_{im_{j_{1}}}-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j})^{2}(t_{ij}-\tau_{j})}{2s^{2}}\right)\\ &&\times \sum\limits_{m=1}^{\infty}m\sin\!\left( \frac{\pi m}{2}\right)\exp\left( \!-\frac{1}{2}\frac{\pi^{2}s^{2}m^{2}{\xi_{j}^{2}}}{{\gamma_{i}^{2}}}(t_{ij}-\tau_{j})\!\right).\\ \end{array} $$
(12)

In practice, Eq. 12 cannot be directly calculated because it involves the sum of an infinite series. Therefore, Navarro and Fuss’s (2009) approximation is commonly used for its evaluation. This approach is also employed in stan, which is the software we used in this study.

From Eq. 12, the probability of choosing the first statement, which is a counterpart of both Eqs. 4 and 10, is then given as

$$ P(x_{ij} = 1|\boldsymbol{\eta}_{i}) = \frac{\exp\left( \displaystyle\frac{\gamma_{i}}{\xi_{j}}\left( {\upbeta}_{j_{1}}\eta_{im_{j_{1}}}\!-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j}\right)\right)}{1+\exp\left( \displaystyle\frac{\gamma_{i}}{\xi_{j}}\left( {\upbeta}_{j_{1}}\eta_{im_{j_{1}}}\!-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j}\right) \right)}. $$
(13)

In this study, we assume that the researcher already knows the keyed direction of each statement because this is natural in personality measurement using existing scales. Therefore, the sign of \({\upbeta }_{j_{k}}\) is known. We first estimate \({\upbeta }_{j_{k}}\) under the restriction of positivity. Then, we manually invert the sign of the obtained negatively keyed item parameter estimates in order to facilitate their interpretations.

Brown and Maydeu-Olivares (2011) recommends designing a forced-choice questionnaire by preparing both types of items that are keyed in the same direction (e.g., both statements are positively keyed) and those in the opposite direction (one statement is positively keyed and the other negatively keyed) in order to ensure accurate measurement of traits. We recommend the same in the proposed approach.

When the number of latent traits measured in a 2MFC questionnaire is two (M = 2), the proposed model, as well as the Thurstonian IRT model, suffers from the identification problem. This can be readily understood because in this case, all statements have nonzero factor loadings to all traits, which is essentially the case of an exploratory factor analysis. Brown and Maydeu-Olivares (2011, 2012) dealt with this problem by fixing the factor loadings of the first two statements to their true values in the simulation study. Obviously, this approach is not realistic in practice because a researcher never knows the true factor loading parameter values. Therefore, in this paper, we only consider the case when M > 2 and recommend applying the proposed model for measuring more than two traits.

The Bayesian model is completed by placing prior distributions over the parameters. We use the following priors throughout this paper:

$$ \begin{array}{ll} \xi_{j} \sim t_{[0,\infty)}(4,0,2.5), & \gamma_{i} \sim LN(0,1), \\ {\upbeta}_{j_{k}} \sim t_{[0,\infty)}(4,0,2.5), & \boldsymbol{\eta}_{i} \sim MVN(\mathbf{0},\boldsymbol{\Sigma}), \\ \alpha_{j} \sim N(0,2.5), & \tau_{j} \sim U(0, \min (\text{RT})_{j}), \\ & \boldsymbol{\Sigma}\sim LKJCorr(1), \end{array} $$
(14)

where \( {\min \limits } (\text {RT})_{j}\) represents the minimum observed response time for each item j. Our basic principle in specifying the priors here is to place standardized distributions for the respondent parameters and weakly informative priors that give very low probabilities to very implausible parameter values for item parameters. The priors for the respondent parameters are standardized for identification reasons. Hence, the variance of each latent trait is fixed to 1. This means that the covariance matrix Σ reduces to a correlation matrix. Hence, we call Σ as the correlation matrix hereafter and denote its elements by \(\rho _{mm^{\prime }}\). Moreover, γi follows the standard lognormal distribution, and the variance of the drift rate (s2) is set to be 1. The LKJ correlation distribution (Lewandowski, Kurowicka, & Joe, 2009) for the correlation matrix here corresponds to a uniform prior over all possible M × M correlation matrices (Stan Development Team, 2017). For the priors of the item parameters, we adopted the half−t distribution for ξj and \({\upbeta }_{j_{k}}\) as weakly informative priors, which are recommended by Gelman (2006) and Stan Development Team (2017).

The parameters of the proposed Thurstonian D-diffusion IRT model can be estimated using the Markov Chain Monte Carlo (MCMC) algorithm. For all of the estimation results presented below, we used R 3.5.0 and stan 2.17.3 on a Windows 10 PC. Three MCMC chains were run for each dataset. The number of MCMC iterations per chain was 10,000, half of which were discarded as warm-up. The data and computer codes (Stan and R) that we used for real data application are available from the Open Science Framework website (https://osf.io/jswqg/). We used the posterior means of the parameters as their point estimates.

Simulation study

We conducted a simulation study to check the parameter recovery properties of the proposed model. We generated simulation data from the Thurstonian D-diffusion IRT model with some narrower distributions described below and estimated the parameter values from the generated data with the conventional priors described in the previous section.

We considered the following five scenarios in this simulation:

  • The first scenario manipulated the two crossed factors—the number of trait dimensions M = (3,4) and the number of items that consist of the same pairs of dimensions Jpair = (3,5,7)—to examine their effects on parameter recovery. Note that the total number of items J is related to these factors by

    $$ J=J_{pair} \times \frac{M(M-1)}{2}. $$
    (15)

    The trait dimensions were independent of one another in this scenario and the second scenario.

  • The second scenario examined the influence of the number of dimensions M when the questionnaire length was fixed. For this purpose, we manipulated the number of dimensions M = (3,4) under conditions where the total number of items J = (12,24,36).

  • The third and fourth scenarios examined if the proposed model can properly recover the parameter values even when the dimensions are correlated. The factor examined in the third scenario was the correlations between the dimensions, which correspond to the off-diagonal elements of Σ. All of the off-diagonal elements were set to have the same values, which we denote by ρpair. Its three manipulated conditions were ρpair = (− 0.3,0.3,0.5). In this scenario, we set M = 3 and Jpair = 5.

  • The fourth scenario considered more realistic conditions and attempted to emulate the Big-Five factor structure. Specifically, the correlation matrix Σ was specified on the basis of Brown and Maydeu-Olivares (2011) as

    $$ \boldsymbol{\Sigma} = \left( \begin{array}{ccccc} 1 & & & & \\ -.21 & 1 & & & \\ 0 & .40 & 1 & & \\ -.25 & 0 & 0 & 1 & \\ -.53 & .27 & 0 & .24 & 1 \end{array} \right). $$

    We denote this as ρpair = BF for simplicity. In this scenario, M = 5, and Jpair was manipulated for three conditions, which were Jpair = (3,5,7).

  • The fifth scenario examined the influence of the complexity (cognitive workload) of the items. In this scenario, we manipulated the true distribution of ξj to be either U(0.2,0.3),U(0.4,0.5), or U(0.6,0.7). Here, we fixed M = 3,Jpair = 5 and assumed no correlations among traits.

This results in a total of 21 conditions. The number of respondents was kept constant at I = 300 for all conditions. In each condition, the cycle of random data generation and parameter estimation was replicated 30 times.

Data generation

The distributions used to generate random data from the proposed model were as follows: \(\gamma _{i} \sim LN(0,1)\), \({\upbeta }_{j_{k}} \sim U(0.5,1.5)\), \(\boldsymbol {\eta }_{i} \sim N(\mathbf {0},\boldsymbol {\Sigma })\), \(\alpha _{j} \sim U(-1.5,1.5)\), and \(\tau _{j} \sim U(0.2,1)\). With regard to ξj, we drew its samples from U(0.3,0.7) in the first to fourth scenarios. These distributions are chosen on the basis of our preliminary analysis, Brown and Maydeu-Olivares (2011) and Bunji and Okada (2019). Note that the off-diagonal elements of the correlation matrix Σ were treated as separate parameters and estimated as such even when the true correlations had the same values. Moreover, for a practical reason,Footnote 1 observations with a response time greater than 120 s were listwise deleted. We considered this manipulation to be acceptable because an observation with such a large response time may be considered as an irregular response.

Results

Tables 1 and 2 summarize the means of the root mean square errors (RMSEs) and the means of the biases for each condition, respectively. The RMSE for the parameter β, for example, is calculated by

$$ \text{RMSE}_{\upbeta}= \sqrt{\frac{1}{J}\frac{1}{2}\sum\limits_{j=1}^{J}\sum\limits_{k=1}^{2}(\hat{\upbeta}_{j_{k}}-{\upbeta}_{j_{k}})^{2}}, $$
(16)

where \({\upbeta }_{j_{k}}\) is the realized value that is generated from U(0.5,1.5) in each iteration and \(\hat {\upbeta }_{j_{k}}\) is its expected a posteriori estimate. Similarly, the bias for the parameter η, for example, is calculated by

$$ \text{bias}_{\eta}= \frac{1}{I}\frac{1}{M}\sum\limits_{i=1}^{I}\sum\limits_{m=1}^{M}(\hat{\eta}_{im}-\eta_{im}). $$
(17)

Since γi is lognormally distributed, the RMSEs and biases of the log-transformed estimates for γ are presented. The RMSEs for both β and ρ are acceptably small when M = 3 and Jpair = 3, even when J is as small as 9. These results suggest the proposed model is capable of appropriately estimating the trait loadings \({\upbeta }_{j_{k}}\) and the correlations between traits \(\rho _{mm^{\prime }}\).

Table 1 Mean RMSEs of the parameter estimates in the simulation study
Table 2 Mean biases of the parameter estimates in the simulation study

In the second scenario, the RMSEs for η become larger when the number of traits increases. Therefore, this suggests that a smaller number of traits would provide greater accuracy in the estimation of ηim. Further, since the RMSE decreases as J increases, it would be desirable to have many items in the questionnaire if possible.

The RMSEs for η might look rather large for all conditions. However, these values are actually comparable in scale with the values reported by Stone (1992) when the number of items for each factor is larger than 10 (for example, the number of items for each factor is 10 when M = 3 and Jpair = 5).

A similar estimation accuracy is obtained when nonzero correlations exist between traits, which corresponds to the third and fourth scenarios. In addition, the trait correlations are successfully recovered for all conditions.

From the results of the fifth scenario, we can see that the estimation accuracy of ηim increases as the true distribution of ξj shrinks. In other words, given the same number of items, the estimation accuracy tends to be better as the items require more cognitive workload. This relationship is consistent with the specifications of the diffusion IRT model, in which the item discrimination is equal to the boundary parameter, because a smaller ξj leads to higher discrimination.

The means of the biases for ξ and \(\log (\gamma )\) are positive for almost all conditions. They become larger according to the increase in the total number of items J. In the same manner, the mean RMSEs for these parameters also become larger in accordance with the increase in J. The reason for these results can be understood from the form of the parameter constraints, as described below. ξj and γi originally have a mutual sign indeterminacy because they are given in the form of a quotient. To avoid this indeterminacy, a parameter constraint needs to be introduced. In this study, we chose the prior distribution for γi to be the standard lognormal distribution. The effect of the priors tends to be reduced, corresponding to the increase in the data size, and this may explain the result that larger estimates for both ξj and γi tend to be obtained for larger J. However, we note that these biases and RMSEs are still practically sufficiently small to be acceptable. Moreover, they do not affect the relative order of the respondents or items effects, which should be more practically important. Furthermore, the biases for the other parameters are nearly zero and neither systematically positive nor negative.

Sensitivity analysis

In order to check that our prior specification does not actually have a strong influence on parameter estimation, we conducted a sensitivity analysis. In this sensitivity analysis, we considered the second scenario and changed the priors for the item parameters to be more diffuse ones. Specifically, the priors for the parameters ξj, \({\upbeta }_{j_{k}}\), and αj were set as a diffuse N(0,100) prior, and parameter recovery was comparatively checked. The other conditions of the simulation were the same as before. The results suggest that the change in the priors has only a minor influence on the parameter estimates (see Table A1 in the Online Supplementary Material). Considering the estimation efficiency, we think that it is practically adequate to employ the prior specification of Eq. 14, provided that the sample size is not very small.

To summarize the simulation results, the proposed Thurstonian D-diffusion IRT model is found to have sufficient parameter recovery properties.

Real data application: Big-Five data

In this section, we demonstrate the application of the proposed model to real 2MFC format personality measurement data, which we collected along with the response time information.

Data

We collected data from a sample of 500 Japanese respondents through CrowdWorks, a major online crowdsourcing service in Japan. The sample consists of 232 males and 262 females (the remaining six did not answer the question), and their ages range from their 20s to their 70s. The survey was conducted in an online survey environment that we developed with the jsPsych library (de Leeuw, 2015). Online informed consent was obtained from all participants. After data collection, the observations from one respondent were eliminated owing to system trouble. As a result, we applied the proposed model and Thurstonian IRT model to the data with a sample size of 499. In addition, approximately 0.1% of responses for which the response times were shorter than 300 ms were listwise deleted. Woodworth and Schlosberg (1954, Chapter 2) indicate that the minimum response time (they use the term latency) for simple visual tasks is approximately 180 ms. However, because the cognitive comparison task of the current study requires a higher cognitive load, more time would be required to answer the task used in this study. Therefore, we set the lower cutoff to 300 ms.

The items we used originate from the Japanese version of the Big-Five factor marker questionnaire (Apple and Neff, 2012). This scale is intended to measure the Big-Five traits, which are emotional stability, extraversion, agreeableness, conscientiousness, and intellect/imagination. Each trait is measured by 10 statements, which add up to a total of 50 statements. In order to rearrange them into the 2MFC format, we constructed 25 pairs of items from the 50 statements without duplication. The pairs were carefully designed in order to maintain balance among pairs of traits and to include pairs of statements that are keyed both in the same and opposite directions. Table 3 summarizes the resultant (English-translated) items used in this study. Respondents were required to select the statement of the pair that better represents themselves for all 25 items. The order of the items and the order of statements in each item were randomized across respondents.

Table 3 List of items (pairs of statements) used in this study

Parameter estimation

We comparatively estimated the parameters for two models: the Thurstonian IRT model, which only uses the item responses as observed variables, and the proposed model, which uses both the item responses and their response times as observed variables. For both models, the number of MCMC iterations per chain was 10,000, half of which were discarded as warm-up samples. The priors used in the Thurstonian IRT model were a subset of the priors for the proposed model as follows:

$$ \begin{array}{@{}rcl@{}} {\upbeta}_{j_{k}} \sim t_{[0,\infty)}(4,0,2.5), ~~\qquad& \boldsymbol{\eta}_{i} \sim MVN(\mathbf{0},{\Sigma}),\\ \alpha_{j} \sim N(0,2.5), ~~~~~~~~~~~~~\qquad & {\Sigma} \sim LKJCorr(1). \end{array} $$
(18)

Results

Table 4 summarizes the posterior means of the item parameters estimated by both models. Apparently, the point estimates from the Thurstonian IRT model are several times larger than those of the proposed model. A major reason for this is that there exist three types of parameters relates to the discriminability (\({\upbeta }_{j_{k}}\), ξj, and γi) in the proposed model. In fact, the item parameters obtained by the proposed model become similar to those obtained by the Thurstonian IRT model when they are divided by ξj. This is because all values of γi follow LN(0,1) a priori, and the expected value of a random variable that follows LN(0,1) is 1.

Table 4 Posterior means of the item parameter values in the Thurstonian IRT and proposed models

The correlations between the estimates from the two models are .995 for αj and .902 for \({\upbeta }_{j_{k}}\). Therefore, it would be appropriate to consider that the item parameters of the proposed model have similar interpretations as those of the Thurstonian IRT model.

Figure 2 shows a scatter plot between the Thurstonian IRT and proposed models in terms of the first trait estimates ηi1, which correspond to the extraversion trait. The first trait was chosen for illustration purposes, and similar tendencies were observed for all five traits. Table 5 summarizes the correlations between the respondent parameter estimates, both within and between the models. All of the between-model correlations for the same trait are larger than .95, which can be interpreted that very similar respondent parameter estimates are obtained between the two models. Moreover, as for the correlations between traits within a model, similar correlation patterns are obtained for both models, except that they are slightly smaller for the proposed model.

Fig. 2
figure 2

Scatter plot between the ηi estimates from the Thurstonian IRT and proposed models with regard to trait 1 (extraversion). Red line indicates the line of y = x

Table 5 Estimated correlation matrix in the real data application

Figure 3 shows the posterior distribution of ξj for each item. A clear between-item difference can be seen in this figure. In order to check their validity, we calculated the correlation between the posterior means of ξj and the item mean of the readability scores of each statement. The readability score, which quantifies the difficulty in reading the sentence, was computed by jReadability (Hasebe & Lee, 2015, https://jreadability.net/sys/);. The resulting correlation was .539 (95% CI [.162, .751]). This positive correlation can be evidence that ξj reflects empirical item characteristics such as the reading difficulty.

Fig. 3
figure 3

Posterior distribution of ξj for each item. The gray area indicates the 95% CI. The indices of the items correspond to those in Tables 3 and 4

Figure 4 shows two scatter plots between the boundary-related parameter estimates (γi and ξj) and the mean response time. It can be seen that there exist strong associations between the two quantities; the obtained correlation coefficients were .929 and − .716, respectively. These strong relationships between the observed response time and the parameter estimates are consistent with the results of Ratcliff, Thompson, and Mckoon (2015), Tuerlinckx, Molenaar, and Maas (2016). As shown in Eq. 11, the boundary of the diffusion process becomes larger according to the increase in γi. This means that a respondent with a higher value of γi needs more information accumulation to respond to an item. As a result, such a respondent tends to have a longer response time. This relationship is properly reflected in the left panel of Fig. 4. The opposite is true for ξj, which is the denominator of the boundary.

Fig. 4
figure 4

Left panel = scatter plot between the estimate γi and the mean response time for each respondent. Right panel = scatter plot between the estimate of ξj and the mean response time for each item

Next, we examined how the estimates from the proposed model reflect the distance–difficulty relationship. For this purpose, Fig. 5 shows scatter plots between the respondent’s proximity to the comparison threshold (x axis), which is given by \({\upbeta }_{j_{1}}\eta _{im_{j_{1}}}-{\upbeta }_{j_{2}}\eta _{im_{j_{2}}}-\alpha _{j}\), and the observed response time (y axis). In the original diffusion IRT model, the expected response time is given by (Tuerlinckx et al., 2016)

$$ E(t_{ij}) \approx \tau_{j}+\frac{1}{2|\theta_{i} - b_{j}|} \frac{\gamma_{i}}{\xi_{j}}. $$
(19)

From this formula, we can easily obtain the expected response time for the proposed model, i.e.,

$$ E(t_{ij}) \approx \tau_{j}+\frac{1}{2|{\upbeta}_{j_{1}}\eta_{im_{j_{1}}}-{\upbeta}_{j_{2}}\eta_{im_{j_{2}}}-\alpha_{j}|} \frac{\gamma_{i}}{\xi_{j}}. $$
(20)

It is seen in the left panel of Fig. 5, which shows the estimates from the Thurstonian IRT model, that the response time tends to be large when the latent positions of the respondent and item parameters are close to each other. This is no surprise given that the distance–difficulty relation is well-known in personality measurement. However, this tendency is more clearly evident when the parameters are estimated by the proposed model (right panel). This is because the proposed model explicitly accounts for the distance–difficulty relationship by separating the respondent and item factors that affect the response time, such as the psychological traits to be measured and the item complexity.

Fig. 5
figure 5

Scatter plots between the latent respondent position relative to the item parameters (x axis) and the observed response time (y axis). Left panel = results for the Thurstonian IRT model. Right panel = results for the proposed model (y-axis values are divided by γi/ξj in order to adjust the scale to the left panel)

Lastly, we examined how the response time information is utilized for estimating ηi in the proposed model. In the proposed model, the parameter ηi is estimated on the basis of both the item response and response time, whereas in the existing Thurstonian IRT model, the response time information is not used. In order to examine the unique effect of the response time in the proposed model, we first calculated the mean signed standardized response time (MSSRT) for each respondent and each trait by the following procedure:

  1. 1.

    Prepare all 10 item responses and response times (RTs) for each statement of a trait.

  2. 2.

    Standardize the response time using the whole-sample mean and standard deviation of the item. This yields the standardized RT.

  3. 3.

    When a statement is positively keyed and not chosen or when it is negatively keyed and chosen, make the sign of its standardized RT negative. This yields the signed standardized RT.

  4. 4.

    Finally, compute the mean of the signed standardized response time for all 10 items. This is the MSSRT.

If the statement that represents a trait is positively keyed and chosen, the signed standardized RT becomes smaller when it is quickly chosen. If it is positively keyed but not chosen, the signed standardized RT becomes smaller when it takes long time to decide not to choose it. The opposite is true when the statement is negatively keyed. That is, the signed standardized RT becomes smaller if a negatively keyed statement is slowly chosen or quickly avoided.

Here, a possible relationship to be found would be the following. For a smaller MSSRT, the estimate obtained by the proposed model tends to be larger compared to that of the Thurstonian IRT model. In other words, a negative correlation would be expected between the MSSRT and the difference of ηi estimates between the proposed model and the Thurstonian IRT model. Figure 6 shows the scatter plot of the MSSRT and the difference between the two model estimates of ηi. A negative correlation is evident between the two quantities, and little difference among traits was found. The overall correlation was − 0.483 (95% CI [−.513, − .452]). This result suggests that the proposed model successfully incorporates the response time information to supplement the estimation of latent traits.

Fig. 6
figure 6

Scatter plot of the mean signed standardized response time (x axis) and the difference of trait estimates ηi between the proposed model and the Thurstonian IRT model (y axis)

Discussion

The main objective of this study is to propose and examine the Thurstonian D-diffusion IRT model, which is an extension of the Thurstonian IRT model for 2MFC measurement to incorporate the response time information. To achieve this goal, we base our model on the diffusion IRT model, which is a representative cognitive psychometric model, and reparameterized its parameters to match the idea of the Thurstonian IRT model. A simulation study has shown that the parameters of the proposed model can be sufficiently recovered in typical application settings. In the application to Big-Five measurement data, several interesting findings have emerged. The obtained item parameter estimates from the proposed model were similar to those of the Thurstonian IRT model, except for the theoretically expected scale differences. As for the respondent parameters (η), the estimates of the proposed model were also similar to those of the Thurstonian IRT, but there is a difference, which was explained by the distance–difficulty relationship. Because this relationship is believed to be generally applicable in personality measurement, we believe that personality estimation based on the proposed Thurstonian D-diffusion IRT model should be meaningful.

When one trait to be measured is clearly dominant over others for a respondent, this respondent is expected to rapidly choose the statements of this trait. In this way, the proposed model should be able to reflect the degree of dominance of one trait over others. Such usage of response time information is not possible for existing Thurstonian IRT models that only use the item response as observations.

According to Bertling and Weeks (2018, p. 328) the motivation to utilize the response time information in conjunction with the item response can be classified into two types. The first is “to obtain more accurate proficiency level estimates,” and the second is “to estimate examinee performance on a separate latent trait.” Although our situation may be different in that the current study considers personality measurement while Bertling and Weeks (2018) considered ability measurement, we believe that the proposed model is closer to the latter case. That is, the proposed Thurstonian D-diffusion IRT model estimates the item and respondent parameters on the basis of psychological theory models of comparative judgment, the diffusion process, and the distance–difficulty relationship. As a result, the obtained latent traits would be well-isolated from the effects of factors such as the respondents’ deliberateness and item length. Moreover, according to Kahneman (2011), comparative judgment, which involves a slower and more deliberative process, better reflects the cognitive component than single evaluations, which often reflect the intensity of an unstable emotional response. Meanwhile, in the diffusion model, the discrimination parameter correspond to the degree of response caution (boundary); a response that has a longer response time better reflects one’s latent trait. Such a correspondence between these two theories suggests the relevance of this study’s contribution to extend the diffusion model for Thurstonian judgment.

By introducing the approach of cognitive psychometrics, the parameters of the proposed model can quantify different cognitive subskills separately, such as the speed of information uptake and the degree of response caution. This is in contrast to the classical latent variable in psychometrics, which may be an unknown composite of cognitive processes (Riefer, Knapp, Batchelder, Bamber, & Manifold, 2002). Although we have not fully investigated the empirical relations between these parameter estimates and other exogenous variables, we consider such a study based on experiments based on an experiment (e.g., Ratcliff & Rouder, 1998; Voss, Rothermund, & Voss, 2004) would be of great interest.

Another advantage of the application of cognitive psychometric models is that they can incorporate the properties of a rather complex measurement environment such as the time pressure. For example, Milosavljevic, Malmaud, Huth, Koch, and Rangel (2010) revealed that when a respondent is under a high time pressure condition, the mean response time becomes shorter than under the low time pressure condition; correspondingly, the boundary parameter estimates in the diffusion model become smaller. For this type of high time pressure, the responses become faster and, at the same time, more inaccurate. The diffusion-based models can appropriately account for such effects of the time pressure (e.g., Voss et al., 2004). Such an explanatory power would be one of the advantages of the application of cognitive psychometric models.

Finally, we note a few limitations of the current study. First, a forced-choice item typically provides lower information than a Likert-scale item. As noted by Brown and Maydeu-Olivares (2013), although a five-point Likert-scale item provides four pieces of information, a two-alternative forced-choice item provides only one piece of information. Therefore, in practice, care would be needed when designing the forced-choice test, in terms of the number of items and the number of traits, for example, in order to ensure its reliability. Meanwhile, as we noted in the introduction, one of the major motivations to use the forced-choice format is to reduce the effect of possible biases that originate from the Likert measurement. Recent studies that compared forced-choice Thurstonian IRT measurement and Likert measurement have found evidence that forced-choice measurement has higher validity, even though its obtained information is smaller (e.g., Guenole, Brown, & Cooper, 2018; Lee, Joo, & Lee, 2019). That being said, we have not examined the external validity of the latent traits estimated by the proposed model. Although the obtained parameter estimates are based on psychologically meaningful models, an empirical comparison of the external validity of the trait parameter estimates with existing methods remains an important subject for future study.

In addition, we selected the diffusion model as the base model of the current study. A limitation of the diffusion model is that it is only applicable to items that consist of two statements. When considering more than two statements for an item, we have to use other models as the base model. The linear ballistic accumulation (LBA) model (Brown & Heathcote, 2008) would be a strong candidate in such a case because “For multiple choices between N > 2 alternatives, only the LBA has simple-to-use analytic solutions, making it the preferred choice” (Brown & Heathcote, 2008, p. 173). In fact, the LBA model can be thought to be as popular as the diffusion model in cognitive psychology studies. Recently, Bunji and Okada (2019) have developed an IRT extension of the LBA model. Although this is not a statistically straightforward extension, it would be an important direction for further study to extend the LBA IRT model to incorporate the response time information.