Introduction

A cognitive theory is a general postulation of mechanisms and processes that are globally applicable to families of tasks and types of activities rather than being dependent on a particular task. Cognitive models are very specific representations of part or all aspects of a cognitive theory that apply to a particular task or activity (Gonzalez, 2017). Specifically, normative and descriptive theories of choice often rely on utility theory (Savage, 1954; Morgenstern & Von Neumann, 1953) or aim at describing the psychological impact of perceptions of probability and value on choice (Kahneman & Tversky 1979; 1992). In contrast, models of decisions from experience (DfE) are often dynamic computational representations of sequential choices that are distributed over time and space and that are made under uncertainty (Gonzalez et al., 2017).

Cognitive models of DfE can be used to simulate the interaction of theoretical cognitive processes with the environment, representing a particular task. These models can make predictions regarding how human choices are made in such tasks. These predictions are often compared to data collected from human participants in the same tasks using interactive tools. The explicit comparison of cognitive models’ predictions to human actual behavior is a common research approach in the cognitive sciences and in particular in the study of decision-making (Gonzalez, 2017).

Cognitive models are dynamic and adaptable computational representations of the cognitive structures and mechanisms involved in decision-making tasks such as DfE tasks under conditions of partial knowledge and uncertainty. Moreover, cognitive models are generative, in the sense that they actually make decisions in similar ways like humans do, based on their own experience, rather than being data-driven and requiring large training sets. In this regard, cognitive models differ from purely statistical approaches, such as machine learning models, that are often capable of evaluating stable, long-term sequential dependencies from existing data but fail to account for the dynamics of human cognition and human adaptation to novel situations.

There are many models of DfE as evidenced by past modeling competitions (Erev et al., 2010; Erev et al., 2017). Most of these models often make broadly disparate assumptions regarding the cognitive processes by which humans make decisions (Erev et al., 2010). For example, the models submitted to these competitions are often applicable to a particular task or choice paradigm rather than presenting an integrated view of how the dynamic choice process from experience is performed by humans. Associative learning models are a class of models of DfE that conceptualize choice as a learning process that stores behavior–outcome relationships and are contingent on the environment (Hertwig, 2015). A common example of this type of models is reinforcement learning (RL) (Sutton & Barto, 2018), and the association between DfE and RL is becoming more explicit in the literature (Konstantinidis et al., 2020; Speekenbrink & Konstantinidis, 2015). Generally speaking, these kinds of models rely on learning from reinforcement and the contingencies of the environment as in the Skinnerian tradition (Skinner, 2014; Sutton & Staw, 1995). These models have shown to be successful at representing human learning over time based on feedback.

In contrast to many of the associative learning models, instance-based learning (IBL) models rely on a single dynamic decision theory: instance-based learning theory (IBLT) (Gonzalez et al., 2003). IBLT emerged from the need to explain the process of dynamic decision-making, where a sequence of interdependent decisions are made sequentially, over time. IBLT provides a single general algorithm and mathematical formulations of memory retrieval that rely on the well-known ACT-R cognitive architecture (Anderson & Lebiere, 2014). The theory proposes a representation of decisions in the form of instances, which are triplets involving state, action, and utilities. In general, states are a representation of the features of the situation of the environment in a task, actions are decisions an agent makes in such states, and utilities are the expectations the agent generates or the outcomes the agent receives from performing such actions. The theory also provides a process of retrieval of past instances based on their similarity to a current decision situation, and the generation of accumulated value (expectation from experience) based on a mechanism called blending, which is a function of the payoffs experienced and the probability of retrieving those instances from memory (Lebiere, 1999; Lejarraga et al., 2012; Gonzalez & Dutt, 2011).

Initially, IBLT was demonstrated in a highly complex, dynamic decision-making task representing the complex process of dynamic allocation of limited resources over time and under time constraints in a “water purification plant” (Gonzalez et al., 2003). Since its inception, many models have been developed based on IBLT, demonstrating human DfE in a large diversity of contexts and domains, from simple and abstract binary choice dynamics (Gonzalez & Dutt, 2011; Lejarraga et al., 2012), to highly specialized tasks such as cyber defense (Aggarwal et al., 2020; Cranford et al., 2020) and anti-phishing detection (Cranford et al., 2019). Also, IBL models have been created to account for dyadic and group effects, where each individual in a group is represented by an IBL agent (Gonzalez et al., 2015). More recently, this IBL algorithm has been applied to multi-state gridworld tasks (Nguyen & Gonzalez, 2020a; 2020b, 2021b) in which the agents execute a sequence of actions with delayed feedback. The recent applications of IBLT have led to significantly more complex and realistic tasks, where multi- dimensional state-action-utility representations are required, where extended training is common, where real-time interactivity between models and humans is needed to solve such tasks (Nguyen & Gonzalez, 2021b).

With the increased use of IBLT in generating models on tasks of greater complexity and in multiple domains, it has become clear that the initial, two-decade-old conceptualization of IBLT needs to be updated. As IBLT has evolved, the initial description of the theory has become less precise, given that no formal implementation of the IBLT process was provided. Thus, a comprehensive description of the current state of the theory along with a concrete implementation of the whole IBL process is essential. Moreover, it is important to demonstrate the full capability and generality of IBLT in a single manuscript, which explains and illustrates how models of multiple and diverse decision tasks can be constructed based on the same theory to generate predictions regarding DfE and learning across a wide range of decision-making tasks. With that, the major goal of this paper is to provide an updated view of the theoretical components of IBLT in a comprehensive and precise form. We also provide an open-source, efficient implementation of the full set of mechanisms of IBLT and demonstrate how such implementation can handle a diverse taxonomy of individual and multi-agent decision-making tasks.

In the process of generating IBL models for more complex tasks that require real-time interactivity between models and humans, we have confronted a practical computational problem, the curse of exponential growth (Bellman, 1957; Kuo & Sloan, 2005). The curse of exponential growth is a common problem in models that rely on the accumulation of data over time and on computation of approximate value functions represented as arrays and tables, such as RL models (Sutton & Barto, 2018). As summarized in a recent overview of the challenges in multi-agent RL models, even advanced deep reinforcement learning techniques with many successes in Atari, Go, and StarCraft games (Mnih et al., 2013; Silver et al., 2016; Vinyals et al., 2019) suffer severely from the increase in the dimensions of the state-action space, particularly as the number of agents increases (Wong et al., 2021). The problem becomes even more complex under nonstationary environments and under uncertainty, where information is incomplete. Dynamic conditions significantly increase the diversity and number of states as it is needed for every dynamic decision-making task (Gonzalez et al., 2017). Thus, this paper also addresses the critical question of how IBL models can tackle the curse of exponential growth of memory.

In summary, we present three main contributions. First, an updated view of IBLT provides a comprehensive and precise view of the current theoretical components of the theory, offering a concrete generic algorithm with a formal implementation of the general process of IBLT. Second, we demonstrate the applicability of IBLT across a taxonomy of decision-making tasks varying in the number of agents, the number of actions, the number of decision options and states, and the type of delayed feedback. Third, we provide a new, open-source library, SpeedyIBL, that can handle the curse of exponential growth. SpeedyIBL allows users to create multiple IBL agents relying on IBLT with fast processing and response time while maintaining the decision characteristics of IBL models. We demonstrate how SpeedyIBL is increasingly beneficial (compared to existing implementations, PyIBL (Morrison & Gonzalez, 2015)) as the dimensions of the representation, the number of agents and their interactions increase. Through simulation experiments, we demonstrate how IBL models are able to provide predictions across a taxonomy of decision-making tasks with escalating complexity, and how SpeedyIBL is increasingly more efficient than PyIBL (Morrison & Gonzalez, 2015) as the dimensions of task complexity increase.

Instance-based learning theory

An updated view of the general decision process proposed in IBLT is illustrated in Fig. 1, and the current mechanisms of IBLT are made mathematically concrete in Algorithm 1 (Gonzalez et al., 2003).

Fig. 1
figure 1

IBLT process from Gonzalez et al., (2003)

The process starts with the observation of the environmental state, and the determination of whether there are past experiences in memory (i.e., instances) that are similar to the current environmental state (i.e., re cognition). Whether there are similar past instances will determine the process used to generate the expected utility of a decision alternative (i.e., judgment). If there are past experiences that are similar to the current environmental state, the expected utility of such an alternative is calculated via a process of blending past instances from memory; but if there are no similar past instances, then the theory suggests that a heuristic is used to generate the expected utility, instead. After judgment, the option with the highest expected utility is maintained in memory and a decision is made as to whether to stop the exploration of additional alternatives and execute the current best decision (i.e., choice) or to continue exploring new alternatives (i.e., exploration loop). When the exploration process ends, the choice that has the highest expected utility is executed, which changes the environment (i.e., execution loop). The loop from recognition to execution continues over time, and the result from a decision may be observed from the environment (i.e., feedback) immediately or with delay from the execution of a choice. Such a decision result (e.g., a reward) is used to update the utility of past instances in memory through a credit assignment mechanism.

In IBLT, an “instance” is a memory unit that results from the potential alternatives evaluated. These memory representations consist of three elements which are constructed over time: a situation state s, which is composed of a set of features f; a decision or action a taken corresponding to an alternative in state s; and an expected utility or experienced outcome x of the action taken in a state.

Each instance in memory has an activation value, which represents how readily available that information is in memory, and it is determined by the similarity to past situations, recency, frequency, and noise according to the activation equation in ACT-R (Anderson & Lebiere, 2014). Activation of an instance is used to determine the probability of retrieval of an instance from memory which is a function of its activation relative to the activation of all instances corresponding the same state in memory. The expected utility of a choice option is calculated by blending past outcomes. This blending mechanism for choice has its origins in a more general blending formulation (Lebiere, 1999), but a simplification of this mechanism is often used in models with discrete choice options, defined as the sum of all past experienced outcomes weighted by their probability of retrieval (Gonzalez & Dutt, 2011; Lejarraga et al., 2012). This formulation of blending represents the general idea of an expected value in decision-making, where the probability is a cognitive probability, a function of the activation equation in ACT-R. Algorithm 1 provides a formal representation of the general IBL process.

figure a

Concretely, for an agent, an option k = (s,a) is defined by taking action a after observing state s.

At time t, assume that there are nkt different considered instances \((k_{i},x_{ik_{i}t})\) for i = 1,...,nkt, associated with k. Each instance i in memory has an Activation value, which represents how readily available that information is in memory and expressed as follows (Anderson and Lebiere, 2014):

$$ \begin{array}{@{}rcl@{}} {\varLambda}_{ik_{i}t} &=& \ln{\left( \sum\limits_{t^{\prime} \in T_{ik_{i}t} }(t-t^{\prime})^{-d}\right)} + \alpha\sum\limits_{j}Sim_{j}({f^{k}_{j}},f^{k_{i}}_{j}) \\&&+ \sigma\ln{\frac{1-\xi_{ik_{i}t}}{\xi_{ik_{i}t}}}, \end{array} $$
(1)

where d, α, and σ are the decay, mismatch penalty, and noise parameters, respectively, and \(T_{ik_{i}t}\subset \{0,...,t-1\}\) is the set of the previous timestamps in which the instance i was observed, \({f_{j}^{k}}\) is the j-th attribute of the state s, and Simj is a similarity function associated with the j-th attribute. The second term is a partial matching process reflecting the similarity between the current state s and the state of the option ki. The rightmost term represents a noise for capturing individual variation in activation, and \(\xi _{ik_{i}t}\) is a random number drawn from a uniform distribution U(0,1) at each timestep and for each instance and option.

Activation of an instance i is used to determine the probability of retrieval of an instance from memory.

The probability of an instance i is defined by a soft-max function as follows

$$ P_{ik_{i}t} = \frac{e^{{\varLambda}_{ik_{i}t}/\tau}}{{\sum}_{j = 1}^{n_{kt}}e^{{\varLambda}_{jk_{j}t}/\tau}}, $$
(2)

where τ is the Boltzmann constant (i.e., the “temperature”) in the Boltzmann distribution. For simplicity, τ is often defined as a function of the same σ used in the activation equation \(\tau = \sigma \sqrt {2}\).

The expected utility of option k is calculated based on blending as specified in choice tasks (Lejarraga et al., 2012; Gonzalez & Dutt, 2011):

$$ V_{kt} = {\sum}_{i=1}^{n_{kt}}P_{ik_{i}t}x_{i k_{i} t}. $$
(3)

The choice rule is to select the option that corresponds to the maximum blended value. In particular, at the l-th step of an episode, the agent selects the option (sl,al) with

$$ a_{l} = \arg\max_{a\in A} V_{(s_l,a)t} $$
(4)

The flag delayed on line 14 of Algorithm 1 is true when the agent knows the real outcome after making a sequence of decision without feedback. In such case, the agent updates outcomes by using one of the credit assignment mechanisms (Nguyen et al., 2021). It is worth noting that when the flag delayed is true depends on a specific task. For instance, delayed can be set to true when the agent reaches the terminal state, or when the agent receives a positive reward.

SpeedyIBL implementation

From the IBL algorithm 1, we observe that its computational cost revolves around the computations on lines 6 (Eq. 1), 7 (Eq. 2), 8 (Eq. 3), and the storage of instances with their associated time stamps on line 13.

Clearly, when the number of states and action variables (dimensions) grow, or the number of IBL agent objects increases, the execution of steps 6 to 8) in Algorithm 1 will directly increase the execution time. The “speedy” version of IBL (i.e., SpeedyIBL) is a library focused on dealing with these computations more efficiently.

SpeedyIBL algorithm is the same as that in Algorithm 1. The innovation is in the mathematics. Equations 12 and 3 are replaced with Eqs. 67 and 8, respectively (as explained below). Our idea is to take advantage of vectorization, which typically refers to the process of applying a single instruction to a set of values (vector) in parallel, instead of executing a single instruction on a single value at a time. In general, this idea can be implemented in any programming language. We particularly implemented these in Python, since that is how PyIBL is implemented (Morrison & Gonzalez, 2015).

Technically, the memory in an IBL model is stored by using a dictionary \(\mathcal M\) that, at time t, represented as follows:

$$ \mathcal M = \biggl\{k_{i}: \{x_{ik_{i}t}: T_{ik_{i}t}, ...\}, ...\biggr\}, $$
(5)

where \((k_{i},x_{ik_{i}t},T_{ik_{i}t})\) is an instance i that corresponds to selecting option ki and achieving outcome \(x_{ik_{i}t}\) with \(T_{ik_{i}t}\) being the set of the previous timestamps in which the instance i is observed.

To vectorize the codes, we convert \(T_{ik_{i}t}\) to a NumPyFootnote 1 array (Harris et al., 2020) on which we can use standard mathematical functions with built-in Numpy functions for fast operations on entire arrays of data without having to write loops.

After the conversion, we consider \(T_{ik_{i}t}\) as a NumPy array. In addition, since we may use a common similarity function for several attributes, we assume that f is partitioned into J non-overlapping groups f[1],...,f[J] with respect to the distinct similarity functions Sim1,...,SimJ, i.e., f[j] contains attributes that use the same similarity function Simj. We denote \(S(f^{k},f^{k_{i}})\) the second term of Eq. 1 computed by:

figure b

Hence, the activation value (see Eq. 1) can be fast and efficiently computed as follows:

$$ \begin{array}{@{}rcl@{}} {\varLambda}_{ik_{i}t} &=& \texttt{math.log}(\texttt{sum}(\texttt{pow}(t-T_{ik_{i}t},-d))) \\&&+ \alpha*S(f^{k},f^{k_{i}}) \\&&+ \sigma*\texttt{math.log}((1-\xi_{ik_{i}t})/\xi_{ik_{i}t}). \end{array} $$
(6)

With the vectorization, the operation such as pow can be performed on multiple elements of the array at once, rather than looping through and executing them one at a time. Similarly, the retrieval probability (see Equation 2) is now computed by:

$$ P_{kt} := [P_{1k_{1}t},...,P_{n_{kt}k_{n_{kt}}t}] = v/\texttt{sum}(v), $$
(7)

where \(v = \texttt {math.exp}([{\varLambda }_{1k_{1}t},...,{\varLambda }_{n_{kt}k_{n_{kt}}t}]/\tau )\). The blended value (see Equation 3) is now computed by:

$$ V_{kt} = \texttt{sum}(x_{kt}*P_{kt}), $$
(8)

where \(x_{kt}: = [x_{1k_{1}t},...,x_{n_{kt}k_{n_{kt}}t}]\) is a NumPy array that contains all the outcomes associated with the option k.

Experiments: Demonstration of the general applicability of IBLT

To demonstrate the applicability of IBLT through a wide range of decision tasks as well as to assess the efficiency of SpeedyIBL, we compare SpeedyIBL performance against a regular implementation of the IBL algorithm (Algorithm 1) in Python (PyIBL Morrison & Gonzalez, 2015), in six different tasks that were selected to represent different dimensions of complexity in dynamic decision-making tasks (Gonzalez et al., 2005).

A taxonomy of individual and multi-agent decision-making tasks

Generally, computational cognitive science has taken advantage of the availability of large amounts of behavioral data to advance the “explanation” of cognitive processes involved in various types of tasks, notably, decision-making (Griffiths, 2015). These models often make excellent predictions of human choices in a particular task. However, for the advancement of cognitive science, it is generally important not to simply make accurate predictions in a specific task but to also provide general explanations and understanding of how and why people behave the way they do across tasks.

The development of computational cognitive models that are based on cognitive theories are expected to provide prediction power without a heavy reliance on data (Hofman et al., 2021). IBLT is a general postulation of mechanisms and processes that are globally applicable to families of dynamic decision tasks, rather than being dependent on the requirements of a particular task. In this section, we present a taxonomy of decision-making tasks that IBLT can address.

Table 1 provides an overview of six dimensions to vary in six different decision-making tasks: (1) number of agents, (2) number of actions, (3) complexity of the states, (4) number of choice options (i.e., alternatives), (5) similarity across states, and (6) feedback delays. The table also presents six tasks that were selected to illustrate how IBLT can handle these dimensions. Although we selected these six specific tasks to illustrate the generality of IBLT, it is important to note that the theory is applicable to any diversity of tasks within these dimensions. For example IBLT can handle any number of agents, actions, and other task complexities.

Table 1 Taxonomy of decision-making dimensions, and the illustration of six decision-making tasks

In terms of the number of agents, we selected four single agent tasks, one task with two agents, and one task with three agents. The tasks selected for demonstration can have between two to nine potential actions, the number of states and choice options also vary from just a few to a significant large number. We also include one task that requires of similarity judgments across states (i.e., partial matching in Eqs. 1 and 6) and five tasks that do not use similarity judgments. Finally, we include one task with immediate feedback and five tasks that involve feedback delays.

We describe each of the tasks below, starting from the simplest task (repeated Binary choice), and moving up in the level of task complexity. The binary choice task has only one state and two options; the Insider attack task is a two-stage game in which players choose one of six targets after observing their features to advance. We then scale up to a larger number of states and actions in significantly more complex tasks. A Minimap task representing a search and rescue mission and Ms. Pac-Man tasks have a larger number of discrete state-action variables. Next, we scale up to two multi-agent tasks: the Fireman task has two agents and four actions, and a Cooperative navigation task in which three agents navigate and cooperate to accomplish a goal. The number of agents increases the memory computation, since each of those agents adds their own variables to the joint state-action space. Based on these dimensions of increasing complexity, we expect that SpeedyIBL’s benefits over PyIBL will be larger with increasing complexity of the task.

Binary choice

In each trial, the agent is required to choose one of two options: Option A or Option B (as illustrated in Fig. 2). A numerical outcome drawn from a distribution after the selection, is the immediate feedback of the task. This is a well-studied problem in the literature of risky choice task (Hertwig et al., 2004), where individuals make decisions under uncertainty. Unknown to the agent is that the options A and B are assigned to draw the outcome from a predefined distribution. One option is safe and it yields a fixed medium outcome (i.e., 3) every time it is chosen. The other option is risky, and it yields a high outcome (4) with some probability 0.8, and a low outcome (0) with the complementary probability 0.2.

Fig. 2
figure 2

Binary choice task

An IBL model of this task has been created and reported in various past studies, including (Gonzalez & Dutt, 2011; Lejarraga et al., 2012). Here, we conducted the simulations of 1000 runs of 100 trials. We also run the experiment with 5000 trials to more clearly highlight the difference between PyIBL and SpeedyIBL. The default utility x0 was set to 4.4. For each option k, where k is either A or B, we consider all the generated instances taking the form of (k,x), where x is an outcome. The performance is determined by the average proportion of the maximum reward expectation choice (PMax).

Insider attack game

The insider attack game is an interactive task designed to study the effect of signaling algorithms in cyber deception experiments (e.g., Cranford et al., 2018). Figure 3 illustrates the interface of the task, including a representation of the agent (insider attacker) and the information of six computers. Each of the six computers is “protected” with some probability (designed by a defense algorithm). Each computer displays the monitoring probability and potential outcomes and the information of the signal. When the agent selects one of the six computers, a signal is presented to the agent (based on the defense signaling strategy); which informs the agent whether the computer is monitored or not. The agent then makes a second decision after the signal: whether to continue or withdraw the attack on the pre-selected computer. If the agent attacks a computer that is monitored, the player loses points, but if the computer is not monitored, the agent wins points. The signals are, therefore, truthful or deceptive. If the agent withdraws the attack, it earns zero points.

Fig. 3
figure 3

Insider attack game

In each trial, the agent must decide which of the six computers to attack, and whether to continue or withdraw the attack after receiving a signal. An IBL model of this task has been created and reported in past studies (e.g., Cranford et al., 2019; ??). We perform the simulations of 1000 runs of 100 episodes. For each option (s,a), where the sate s is the features of computers including reward, penalty and the probability that the computers is being monitored (see Cranford et al., 2019 for more details), and a ∈{1,...,6} is an index of computers, we consider all the generated instances taking the form of \((s^{\prime },a,x)\) with \(s^{\prime }\) being a state and x being an outcome. The performance is determined by the average collected reward.

Search and rescue in Minimap

The Minimap task is inspired by a search and rescue scenario, which involves an agent being placed in a building with multiple rooms and tasked with rescuing victims (Nguyen & Gonzalez, 2021a). Victims have been scattered across the building and their injuries have different degrees of severity with some needing more urgent care than others. In particular, there are 34 victims grouped into two categories (24 green victims and ten yellow victims). There are many obstacles (walls) placed in the path forcing the agent to look for alternative routes. The agent’s goal is to rescue as many victims as possible. The task is simulated as a 93 × 50 grid of cells which represents one floor of this building (see Fig. 4). Each cell is either empty, an obstacle, or a victim. The agent can choose to move left, right, up, or down, and only move one cell at a time.

Fig. 4
figure 4

Search and rescue map. The empty cells are white and the walls are black. The yellow and green cells represent the locations of the yellow and green victims, respectively. The cell with the red color represents the start location of the agent

The agent receives a reward of 0.75 and 0.25 for rescuing a yellow victim and a green victim, respectively. Moving into an obstacle or an empty cell is penalized by 0.05 or 0.01 accordingly. Since the agent might have to make a sequence of decisions to rescue a victim, we update the previous instances by a positive outcome that once the agent receives.

An IBL model of this task has been created and reported in past studies (Gulati et al., 2021). Here we created the SpeedyIBL implementation of this model to perform the simulation of 100 runs of 100 episodes. An episode terminates when a 2500-trial limit is reached or when the agent successfully rescues all the victims. After each episode, all rescued victims are placed back at the location where they were rescued from and the agent restarts from the pre-defined start position.

In this task, a state s is represented by a gray-scale image (array) with the same map size. We use the following pixel values to represent the entities in the map: s[x][y] = 240 if the agent locates at the coordinate (x,y), 150 if a yellow victim locates at the coordinate (x,y), 200 if a green victim locates at the coordinate (x,y), 100 if an obstacle locates at the coordinate (x,y), and 0 otherwise. For each option (s,a), where s is a state and a is an action, we consider all the generated instances taking the form of (s,a,x) with x being an outcome. The default utility was set to 0.1. The flag delayed is set to true if the agent rescues a victim, otherwise false. The performance is determined by the average collected reward.

Ms. Pac-Man

The next task considered in the experiment is the Ms. Pac-Man game, a benchmark for evaluating agents in machine learning, e.g., Hasselt et al., (2016). The agent maneuvers Pac-Man in a maze while Pac-Man eats the dots (see Fig. 5).

Fig. 5
figure 5

Ms. Pac-Man game

In this particular maze, there are 174 dots and each one is worth 10 points. A level is finished when all dots are eaten. To make things more difficult, there are also four ghosts in the maze who try to catch Pac-Man, and if they succeed, Pac-Man loses a life. Initially, she has three lives and gets an extra life after reaching 10,000 points. There are four power-up items in the corners of the maze, called power dots (worth 40 points). After Pac-Man eats a power dot, the ghosts turn blue for a short period, they slow down and try to escape from Pac-Man. During this time, Pac-Man is able to eat them, which is worth 200, 400, 800, and 1600 points, consecutively. The point values are reset to 200 each time another power dot is eaten, so the agent would want to eat all four ghosts per power dot. If a ghost is eaten, his remains hurry back to the center of the maze where the ghost is reborn. At certain intervals, a fruit appears near the center of the maze and remains there for a while. Eating this fruit is worth 100 points.

We use the MsPacman-v0 environment developed by Gym OpenAI,Footnote 2 where a state is represented by a color image. Here, we developed an IBL model for this task and created the SpeedyIBL implementation of this model to perform the simulation of 100 runs of 100 episodes. An episode terminates when either a 2500-step limit is reached or when Pac-Man successfully eats all the dots or loses three lives. Like in the Minimap task, for each option (s,a), where s is a state and a is an action, we consider all the generated instances taking the form of (s,a,x) with x being an outcome. The parameter delayed is set to true if Pac-Man receives a positive reward, otherwise it is set to false. The performance is determined by the average collected reward.

Fireman

The Fireman task replicates the coordination in firefighting service wherein agents need to pick up matching items for extinguishing fire. This task was used for examining deep reinforcement learning agents (Palmer et al., 2019). In the experiment, the task is simulated in a gridworld of size 11 × 14, as illustrated in Fig. 6. Two agents A1 and A2 located within the gridworld are tasked with locating an equipment pickup area and choosing one of the firefight items. Afterwards, they need to navigate and find the location of the fire (F) to extinguish it. The task is fully cooperative as both agents are required to extinguish one fire. More importantly, the location of the fire is dynamic in every episode.

Fig. 6
figure 6

Fireman game

The agents receive the collective reward according to the match between their selected firefighting items, which is determined by the payoff matrix in Table 2. The matrix is derived from a partial stochastic climbing game (Matignon et al., 2012) that has a stochastic reward. If they both select the equipment E2, they get 14 points with the probability 0.5, and 0 otherwise. This Fireman task has both stochastic and dynamic properties.

Table 2 Payoff matrix

Here we developed an IBL model for this task. We created the SpeedyIBL implementation of this model to perform the simulations of 100 runs of 100 episodes. An episode terminates when a 2500-trial limit is reached or when the agents successfully extinguish the fire. After each episode, the fire is replaced in a random location and the agents restart from the pre-defined start positions.

Like in the search and rescue Minimap task, a state s of the agent A1 (resp. A2) is represented by a gray-scale image with the same gridworld size using the following pixel values to represent the entities in the gridworld: s[x][y] = 240 (resp. 200) if the agent A1 (resp. A2) locates at the coordinate (x,y), 55 if the fire locates at the coordinate (x,y), 40 if equipment E1 locates at the coordinate (x,y), 50 if equipment E2 locates at the coordinate (x,y), 60 if equipment E3 locates at the coordinate (x,y), 100 if an obstacle locates at the coordinate (x,y), 0 otherwise. Moreover, we assume that the agents cannot observe the relative positions of the other, and hence, their states do not include the pixel values of the other agent. For each option (s,a), where s is a state and a is an action, we consider all the generated instances taking the form of (s,a,x) with x being an outcome. The flag delayed is set to true if the agents finish the task, otherwise false. The performance is determined by the average collected reward.

Cooperative navigation

In this task, three agents (A1, A2, and A3) must cooperate through physical actions to reach a set of three landmarks (L1, L2, and L3) shown in Fig. 7, see Lowe et al., (2017). The agents can observe the relative positions of other agents and landmarks, and are collectively rewarded based on the number of the landmarks that they cover. For instance, if all the agents cover only one landmark L2, they receive one point. By contrast, if they all can cover the three landmarks, they get the maximum of three points. Simply put, the agents want to cover all landmarks, so they need to learn to coordinate the landmark they must cover.

Fig. 7
figure 7

Cooperative navigation

Here we developed an IBL model for this task. We created the SpeedyIBL implementation of this model to perform the simulations of 100 runs of 100 episodes. An episode terminates when a 2500-trial limit is reached or when each of the agents covers one landmark. After each episode, the fire is replaced in a random location and the agents restart from the pre-defined start positions.

In this task, a state s is also represented by a gray-scale image with the same gridworld size using the following pixel values to represent the entities in the environment: s[x][y] = 240 if the agent A1 locates at the coordinate (x,y), 200 if the agent A2 locates at the coordinate (x,y), 150 if the agent A3 locates at the coordinate (x,y), 40 if the landmark L1 locates at the coordinate (x,y), 50 if the landmark L2 locates at the coordinate (x,y), 60 if the landmark L3 locates at the coordinate (x,y), 0 otherwise. For each option (s,a), where s is a state and a is an action, we consider all the generated instances taking the form of (s,a,x) with x being an outcome. The flag delayed is set to true if the agents receive a positive reward, otherwise false. The performance is determined by the average collective reward.

General simulation methods

All the experiments are conducted on a PC AMD 3.00-GHz Ryzen 9 of 16GB RAM and 8 cores with Python 3.7.4 and Numpy 1.19.2. The detailed guideline on how to use the SpeedyIBL package is available at https://github.com/DDM-Lab/SpeedyIBL and the Appendix provides a detailed tutorial including installation of the SpeedyIBL library and examples on how to replicate our demonstrations in the tasks offered in this paper.

The parameter values configured in the IBL models with SpeedyIBL and PyIBL implementations were identical. In particular, we used the decay d = 0.5 and noise σ = 0.25. The default utility values generally set to be higher than the maximum value obtained in the task, to create exploration as suggested in Lejarraga et al., (2012) (see the task descriptions for specific values), and they were set the same for PyIBL and SpeedyIBL.

For each of the six tasks, we compared the performance of PyIBL and SpeedyIBL implementations in terms of (i) running time measured in seconds and (ii) performance. The performance measure is identified within each task.

We conducted 1000 runs of the models and each run performed 100 episodes for the Binary choice and Insider attack. Given the running time required for PyIBL, we only ran 100 runs of 100 episodes for the remaining tasks. We note that an episode of the Binary choice and Insider attack tasks has one step (trial) while the remaining tasks have 2500 steps within each episode.

The credit assignment mechanisms in IBL are being studied in Nguyen and Gonzalez (2020a). In this paper we used an equal credit assignment mechanism for all tasks. This mechanism updates the current outcome for all the actions that took place from the current state to the last state where the agent started or the flag delayed was true.

Results

In this section, we present the results of the SpeedyIBL and PyIBL models across all the considered tasks. The comparison these packages is first provided in terms of the average running time and performance, and then in terms of their learning curves.

Average running time and performance

Table 3 shows the overall average of computational time and Table 4 the average performance across the runs and 100 episodes. The ratio in Table 3 indicates the speed improvement from running the model in SpeedyIBL over PyIBL.

Table 3 Average running time in seconds of a run
Table 4 Average performance of a run of 100 episodes

The ratio of PyIBL running time to SpeedyIBL running time in Table 3 shows that the benefit of SpeedyIBL over PyIBL increases significantly with the complexity of the task. In a simple task such as binary choice, SpeedyIBL performs 1.14 faster than PyIBL. However, the speed-up ratio increases with the higher dimensional state space tasks; for example, in Minimap SpeedyIBL was 279 times faster than PyIBL; and in Ms. Pac-Man SpeedyIBL was 1450 times faster than PyIBL.

Furthermore, the multi-agent tasks exhibit the largest ratio benefit of SpeedyIBL over PyIBL. For example, in the cooperative navigation task, PyIBL took about 2.7 h to finish a run, but SpeedyIBL only takes 2.59 s to accomplish a run.

In all tasks, we observe that the computational time of SpeedyIBL is significantly shorter than running the same task in PyIBL; we also observe that there is no significant difference in the performance of SpeedyIBL and PyIBL (p > 0.05). These results suggest that SpeedyIBL is able to greatly reduce the execution time of an IBL model without compromising its performance.

Learning curves

Figure 8 shows the comparison of average running time (middle column) and average performance (right column) between PyIBL (blue) and SpeedyIBL (green) across episodes for all the six tasks.

Fig. 8
figure 8

The comparison between SpeedyIBL (green line) and PyIBL (blue line) over time in the considered tasks

In the binary choice task, it is observed that there is a small difference in the execution time before 100 episodes; where SpeedyIBL performs slightly faster than PyIBL. To illustrate how the benefit of SpeedyIBL over PyIBL implementation increases significantly as the number of episodes increase, we ran these models over 5000 episodes. The results in Fig. 9 illustrate the curse of exponential growth very clearly, where PyIBL exponentially increases the execution time with more episodes. The benefit of SpeedyIBL over PyIBL implementation is clear with increased episodes. The PMax of SpeedyIBL and PyIBL overlap, again suggesting no different in their performance.

Fig. 9
figure 9

The comparison between SpeedyIBL and PyIBL in 5000 playing episodes of binary choice task

In the Insider attack game as shown Fig. 8a, the relation between SpeedyIBL and PyIBL in terms of computational time shows again, an increased benefit with increased number of episodes. We see that their running time is indistinguishable initially, but then the difference becomes distinct in the last 60 episodes. Regarding the performance (i.e., average reward), again, their performance over time is nearly identical. Learning in this task was more difficult, given the design of this task, and we do not observe a clear upward trend in the learning curve due to the presence of stochastic elements in the task.

In all the rest of the tasks, the Minimap, Ms. Pac-Man, Fireman, and Cooperative navigation, given the multi-dimensionality of these tasks representations and the number of agents involved in Fireman, and Cooperative navigation tasks, the curse of exponential growth is observed from early on, as shown in Fig. 8b. The processing time of PyIBL grows nearly exponentially over time in all cases. The curve of SpeedyIBL also increases, but it appears to be constant in relation to the exponential growth of PyIBL given the significant difference between the two, when plotted in the same scale.

The performance over time is again indistinguishable between PyIBL and SpeedyIBL. Depending on the task, the dynamics, and stochastic elements of the task, the models’ learning curves appear to fluctuate over time (e.g., Ms. Pac-Man), but when the scenarios are consistent over time, the models show similar learning curves for both, PyIBL and SpeedyIBL.

Discussion and conclusions

Cognitive models are used increasingly to make predictions of human behavior and simulate the process by which humans make decisions from experience (Cranford et al., 2020; Nguyen & Gonzalez, 2020b; Nguyen et al., 2021). In particular, many computational models have been developed relying on IBLT (Gonzalez et al., 2003). These IBL models have demonstrated how human decision processes are captured and characterized (Gonzalez & Dutt, 2011), and most importantly, they provide evidence for the application and usefulness of the theory.

In this paper, we present an updated account of IBLT, the current formalization of its theoretical components and a comprehensive and precise presentations of the mechanisms of the theory. We aimed at improving the IBLT clarity and describing the mechanisms behind the general process of IBLT with precise mathematical representations and an algorithm implementation. Crucially, we demonstrated the generality and ability of the theory to predict human learning from experience in a wide variety of decision-making tasks. That is, we provided a demonstration of how models grounded on the same IBLT can be applied and handle decision-making tasks varying in the number of agents, the number of actions, the number of decision options and states, and the type of feedback delays.

We observed that implementing IBL models for these tasks using an existing library, PyIBL (Morrison & Gonzalez, 2015), comes at a practical cost. It is difficult to deal with the exponential growth of the memory of instances as more observations accumulate over time, which leads directly to an exponential slow down of the computational time when the characteristics of the tasks escalate from a single-agent to multi-agent and multi-state settings. Such problem is referred to as the curse of exponential growth, a common computational problem that emerges in many modeling approaches involving tabular computations. Clearly, resolving the curse of exponential growth becomes even more urgent when IBL models are expected to be increasingly used in interactive, real-time tasks that involve humans and models working together, similar to what has been shown recently in a number of RL initiatives (Carroll et al., 2019; Strouse et al., 2021).

To that end, we have developed a new implementation of IBL cognitive models called SpeedyIBL that not only employs a proper data structure for storing memory more efficiently, but also leverages the parallel computation using vectorization (Larsen & Amarasinghe, 2000) to speed up the performance of IBL models in the presence of the curse of exponential growth. We have assessed the robustness of SpeedyIBL by comparing it with PyIBL, a benchmark of the implementation of IBL models in Python (Morrison & Gonzalez, 2015), across a taxonomy of decision-making tasks varying in their increased complexity. We specifically demonstrated that SpeedyIBL implementation is able to perform considerably faster than PyIBL without compromising task performance. Moreover, the results also indicate that the difference in the running time of the SpeedyIBL and PyIBL becomes profound, especially in high-dimensional state spaces and multi-agent domains wherein more agents concurrently collaborate in a task.

Overall, we have introduced SpeedyIBL implementation that enables researchers to create multiple IBL agents relying on IBLT with fast processing and response time. SpeedyIBL can not only be used in simulation experiments of extended learning time, but also can be integrated into browser-based applications in which IBL agents can interact with human subjects in real-time. Given that the computation time of cognitive models in the literature is often overlooked, we believe that the techniques used in SpeedyIBL will be particularly useful for many other ACT-R cognitive models that are still built upon a heavyweight framework programmed in LISP. In that respect, numerous examples can be cited, including a cognitive multi-agent model (Reitter & Lebiere, 2011), a cognitive model for human–robot interaction (Lebiere et al., 2013), hybrid model consisting of a Deep RL agent and a cognitive model (Mitsopoulos et al., 2021), and many other models in the ACT-R literatureFootnote 3. Moreover, provided that research on human–machine behavior has attracted much attention lately, we are convinced that SpeedyIBL will bring significant benefits to researchers and demonstrate the usefulness of IBL models in interactive tasks with human players.

Transparency and openness

SpeedyIBL is provided as a free and open-source Python library. All the codes, extensive documentation, simulation data, and all scripts used for analyses presented in this manuscript are available on GitHub https://github.com/DDM-Lab/SpeedyIBL and on OSF https://osf.io/gwqte/. In addition, the Appendix provides a detailed tutorial including installation of the SpeedyIBL library and examples on how to replicate our demonstrations in the tasks offered in this paper.