This paper presents evidence on the consistency of risk preferences with expected utility theory in a representative population sample. We find that consistency increases with task familiarity and is linked to several personal characteristics such as education, income and asset holdings. Moreover, we investigate the external validity of a laboratory experiment with a student population that implemented the same choice problems as our household panel study. We find that, in line with studies on other biases, deviations from rationality observed in the lab provide a lower bound for deviations in the population at large.

Recently, several studies have made significant progress in understanding risk preferences in populations, making use of innovative survey methods and field experiments (Harrison and List 2004) including game shows with large stakes (Post et al. 2008; Andersen et al. 2008). From the perspective of these studies, the present paper takes one step back by focussing on consistency of risk preferences with expected utility theory in a representative subject pool—well over 1,400 members of the CentER Panel, a representative sample of the Dutch population. We do this by falling back on the oldest consistency test of all—the Allais paradox (Allais 1953). Our results help to understand the reliability and robustness of investigations into the actual distribution of risk preferences in populations.

Our research strategy is threefold. First, we implement three different treatments in the main experiment with the panel. We analyze the original Allais question with payoffs of millions of Euros that, just as when Allais asked Savage, were purely hypothetical. In our second treatment we scaled the payments down but kept them hypothetical. Our third treatment used the same downscaled payoffs but paid them out for real. This enables us to examine to what extent violations are driven by lack of monetary incentives, on the one hand, and non-familiarity with large sums of money on the other.

Second, we are able to exploit the wide range of background information that is available for our subjects in order to study the roots of violations.Footnote 1 Which personal characteristics are correlated with violations? Are violations a matter of insufficient education or limited experience with financial decision making? Can we identify ‘problem groups’ that are, perhaps, more likely to suffer (in particular late in life) from erroneous financial decision making?

Third, we conduct a laboratory experiment with the usual laboratory subject population (students) employing the same design that we used in the panel experiment. Thus, we are able to examine the external validity of a laboratory experiment in a clear and detailed manner. In particular, we can compare whether and how a lab study can tell us something about the population at large.

Pursuing our threefold research strategy we are, thus, able to present very detailed and comprehensive evidence on the Allais paradox. Our results are useful for several practical issues: (1) Our results point to a number of conditions that make standard theoretical predictions more likely to hold, (2) Our results identify certain parts of the population that, due to inconsistencies, may have difficulties in making sound financial decisions, and (3) Our results contribute to a better understanding of what can be reliably learned from laboratory experiments.

Along the first dimension of our research strategy we find that violations in the original paradox are likely to be driven by very high payoffs with which, in real life, virtually nobody has any practical experience. Violations in the original Allais problem are twice as high as in both downscaled versions. This effect has been observed before with student samples (Conlisk 1989); we show that the pattern extends to the general population and across socioeconomic characteristics. Perhaps this result is not surprising as it simply stresses that economic theory can be expected to work much better in environments with which agents have experience and are, thus, well-adapted. On the other hand, we find no substantial difference between the two downscaled versions. Whether subjects are incentivized or not, violations are much lower in both cases.Footnote 2

Along the second dimension, we are able to identify a whole array of personal characteristics that correlate with inconsistent decision making. Education, occupation, income and asset holdings do all correlate with inconsistent decision making and in each case the direction of effects is as one would guess. The better educated are more consistent and so are those in employment, those who earn more and those who hold financial assets.

Finally, our methodological contribution reveals that the laboratory results are rather useful in predicting behavior in a general population. First, the relative treatment differences are precisely the same for both populations, panel and lab. Second, as demonstrated in a number of other studies (see Gächter et al. (2008) for a survey) the violations of standard theory observed in the lab provide a lower bound for violations observed in the population at large.Footnote 3

The remainder of the paper is organized as follows. In Section 1, we describe the main characteristics of the CentERpanel and introduce the experimental design. In Section 2 we present our results obtained with the panel. We first give a quick overview of the results and then present a more detailed analysis, based on regression results, that also accounts for the effect of sociodemographic characteristics. In Section 3 we introduce our lab results and compare them to those obtained in the panel. Section 4 concludes.

1 Design and data collection

We administer the original “Allais questions,” which consist of two pairwise lottery choices. Consider the following two choice problems. First, a subject is asked to choose between lotteries A and A  ∗  where

$$ A= \rm{Certainty\; of \; €\; 1\; Million} \rm{\quad and\quad }A^{\ast }=\left \{ \begin{array}{l} \rm{ 1/100\; Chance\; of\; €\;0} \\ \rm{89/100\; Chance \;of\; €\; 1\; Million} \\ \rm{10/100\; Chance \;of\; €\; 5 \;Million} \end{array} \right. $$

Second, a subject is asked to choose between lotteries B and B  ∗  where

$$ \begin{array}{rll} B&=&\left \{ \begin{array}{l} \rm{89/100 \;Chance \;of €\; 0} \\ \rm{11/100 \;Chance \;of\; €\; 1 Million} \\ \qquad \end{array} \right. \rm{\quad and\quad }\\ B^{\ast }&=&\left \{ \begin{array}{l} \rm{90/100 \;Chance\; of \;€\; 0} \\ \qquad \\ \rm{10/100 \;Chance\; of\; €\; 5 \;Million} \end{array} \right. \end{array} $$

Of the four possible answers AB, A  ∗  B  ∗ , AB  ∗ , and A  ∗  B only the first two are consistent with expected utility theory (henceforth, EUT) whereas the last two are not.Footnote 4 Many laboratory experiments have shown that violations of EUT are frequent and that a larger share of subjects violating EUT chooses AB  ∗  instead of A  ∗  B.Footnote 5

We have six simple treatments using a between-subjects design. To introduce these treatments, consider the following lotteries over three outcomes of monetary payoffs with probabilities as above, i.e., A = (0,1,0), A  ∗  = (.01,.89,.10), B = (.89,.11,0), B  ∗  = (.90,0,.10). Our three treatments were then as follows:

  • Treatment HighHyp: Original Allais questions with high hypothetical payoffs of € 0, € 1 million,and € 5 million.

  • Treatment LowHyp: Allais questions with low hypothetical payoffs of € 0, € 5, and € 25.

  • Treatment LowReal: Allais questions with low real payoffs of € 0, € 5, and € 25.

Note that the amounts of money we use in these treatments are the same as in Conlisk (1989) with the sole difference that he used dollars instead of euros. For all three treatments we had two sub treatments reversing the order of decisions. As we do not find any order effects in the data we pool the data throughout.

We collected data from a representative sample of the Dutch population. The experiments were conducted by CentERdata—an institute for applied economic and survey research for the social sciences—that is affiliated with Tilburg University in the Netherlands. CentERdata carries out its survey research mainly by using its own panel called CentERpanel. This panel is Internet based and consists of some 2000 households in the Netherlands which form a representative sample of the Dutch population.Footnote 6 One of the advantages of the CentERpanel is that the researcher has access to background information for each panel member such as demographic and financial data. Every weekend, the panel members complete a questionnaire on the Internet from their home.

After logging on to our experiment, panel members were randomly assigned to one of the six different treatments introduced above. After being informed about the nature of the experiment, subjects decided whether or not to participate—as common with many modules of the panel. For participating subjects, the next screen introduced an example of a pair of lotteries (which were referred to as “Options”). Subjects were told that their task would be to express preference for one of the two lotteries and, additionally, how the preferred lottery would be executed.Footnote 7 When subjects indicated that they were ready to start the experiment, they were, in two consecutive screens, presented with their two Allais questions. Only after answering both Allais questions, the two preferred lotteries were played out (by the computer) and subjects were informed about the outcome of their two preferred lotteries. In the treatments with real monetary payments, subjects were paid according to the outcomes in both of their preferred lotteries.Footnote 8

In total 1676 members of the CentERpanel logged on to our experiment. Of the subjects logging on, 1426 (85.1%) subjects decided to participate in our experiment while 250 (14.9%) subjects decided not to participate. Table 1 shows descriptive statistics of our sample. The column labeled “Participation” in Table 1 shows descriptive statistics of participating subjects in each of the three main treatments as well as statistics of subjects who chose not to participate in the experiment. The data in Table 1 is grouped according to gender, age, education, occupation and income. (The column labeled “Violation” shows statistics for participating subjects violating or not violating EUT, respectively, which we will analyze further below. It also contains tests on the role of socioeconomic characteristics for EUT violation which will also be discussed later.)

Table 1 Descriptive statistics of the samples

Concentrating on descriptive statistics for participating subjects in Table 1, we note that by and large most variables are relatively identically distributed across treatments. However, in some of the age and income brackets as well as in the category savings account, there is some more variation. A comparison of the descriptive statistics in the columns describing participating subjects with those of non-participating shows that there are no big differences except for the age categories. Basically, older people appear to be a little more reluctant to participate.

Since this causes concern about sample selection problems, we ran for all regressions reported below Heckman (1976) selection models using the variable “Ratio” as one of the exclusion variables. The variable “Ratio” measures the proportion of questionnaires completed by panel members in the three months proceeding our experiment. This variable can be assumed to affect the participation decision but not the decisions taken in the experiment. For none of the regressions we found evidence for a selection bias.Footnote 9

2 Results

2.1 Descriptives

A summary of the experimental results is given in Table 2. The table shows both the absolute frequency of choices (left part) and the relative frequency of choices (right part). As mentioned in the introduction, we will concentrate our analysis on the incidence of subjects’ EUT violation in all treatments. However, we will also shortly answer the question whether violations, once they occur, are systematic.

Table 2 Summary of experimental results in the panel

Violation of EUT

Note that the right-most column in Table 2 indicates that violations of EUT are observed in all treatments. In fact, we observe 49.5%, 19.6% and 25.6% violations of EUT in treatments HighHyp, LowHyp, and LowReal, respectively. Furthermore, in all treatments we observe that the fraction of EUT-violating AB  ∗  answers is higher than the fraction of EUT-violating A  ∗  B answers. The Z-statistic proposed in Conlisk (1989) indicates that the first fraction is significantly higher than the latter fraction at p < 0.001 in all treatments. An interesting question we can answer with our data is whether the differences we report here for the aggregate data are “general” in the sense of applying across socioeconomic attributes or whether they are driven by only some of those attributes. The answer is provided in Tables 5, 6, 7 and 8 in Appendix B, which are structured as Table 1 and provide—for all data and for the three treatments separately—the relative frequency of choices for subjects with various socioeconomic attributes. We observe that EUT violations occur across all socioeconomic attributes and that the “Allais” pattern of more AB  ∗  violations than A  ∗  B violations is significant for most socioeconomic attributes in all treatments (see the column labeled “Sign. of Conlisk’s Z-statistic” in Tables 58 in Appendix 6). We conclude that, as in earlier studies, violations of EUT are observed and that they are systematic in the sense that AB  ∗  is chosen more often than A  ∗  B, mostly independent of socioeconomic background characteristics. To facilitate comparison, note that Conlisk (1989) using a student sample for his “Basic Version” (which is comparable to our treatment HighHyp) reports the following relative frequencies of AB, A  ∗  B  ∗ , AB  ∗ , and A  ∗  B choices: 7.6%, 41.9%, 43.6%, and 6.8%. Thus, he observes EUT violation in 50.4% of the cases which compares to 45.5% in our panel treatment HighHyp.

The effect of high versus small hypothetical payoffs

Next consider the effect of high versus small hypothetical payoffs on the extent of EUT violation. For this purpose we compare the rates of EUT violations in treatments HighHyp and LowHyp. Table 2 shows that the rate of EUT violations drops from 49.4% in treatment HighHyp to 19.6% in treatment LowHyp. The D-statistic proposed in Conlisk (1989) indicates that this difference is highly significant at p < 0.0001 (D = 9.115). Inspecting the relative frequencies of choices in Table 2 shows that moving from HighHyp to LowHyp sharply increases the fraction of choices consistent with expected value maximization (A  ∗  B  ∗ ) at the expense of all other three possible responses. In particular, many more subjects prefer the payoff-maximizing choice A  ∗  over A when (hypothetical) payoffs become small. A possible explanation of this result is due to the fact that subjects in treatment LowHyp can be expected to be more familiar with the lower amounts of money leading them to make fewer mistakes.Footnote 10 Again, with our data we can check whether the result regarding the effect of varying the (hypothetical) stake size just shown for the aggregate data also applies when the data is broken down to various socioeconomic characteristics. Column 3 labeled “Significance of Conlisk’s D-statistic HighHyp vs LowHyp” in Table 9 in Appendix B shows that the answer to this question is, with a few exceptions, yes.

The effect of (small) real versus (small) hypothetical payoffs

Finally, consider the effect of (small) real versus (small) hypothetical payoffs on the extent of EUT violation. To analyze this, compare the rates of EUT violation in treatments LowHyp and LowReal. Table 2 shows that the rate of EUT violations is 19.6% in LowHyp whereas it is 25.6% in treatment LowReal. Thus, we see a slight increase in the share of EUT violations when we move from (small) hypothetical to (small) real payoffs. The D-statistic in Conlisk (1989) indicates that this difference is significant (D = − 1.6716, p = 0.047). In contrast, Harrison (1994) and Burke et al. (1996) report that the use of low real instead of low hypothetical payoffs reduces the extent of EUT violation. For a broader overview on how incentives affect behavior in decisions under risk, see Camerer (1995, p. 634f). Note that the result regarding the switch from (small) hypothetical to (small) real payoffs on the extent of EUT violation is usually not significant when one zooms in on socioeconomic characteristics, as shown in column 4 labeled “Significance of Conlisk’s D-statistic LowHyp vs LowReal” in Table 9 in Appendix B.

Note that our results concerning the extent of EUT violation and the effect of high versus small hypothetical payoffs are not entirely new. We show, however, that they extend to a general population and across socioeconomic characteristics. This should be of interest due to the current discussion about the relationship between results obtained in the lab and those obtained in other settings (see, e.g., Levitt and List 2007).

Let us now turn to providing answers to the first of the two new and main dimensions of our research strategy by inspecting the role of socioeconomic background variables in subjects’ behavioral responses to the Allais questions. Refer to Table 1 that under the heading “Violation” shows descriptive statistics of the subsamples violating and not violating EUT as well as p-levels of χ 2 tests. (For the latter, see the notes below Table 1.) Regarding gender, Table 1 reveals that women are slightly more likely to violate EUT than men. With respect to age, Table 1 does not suggest a clear effect although we note that the age bracket’s [35–44] relative share is higher in the panel’s subpopulation not violating EUT. Regarding education levels, those with lower secondary education and those subjects with a university degree stand out somewhat in the panel. The former because they violate EUT more often and the latter because they violate EUT less often. The most noticeable effect regarding occupation is that those employed on a contractual basis have a higher relative share in the subsample not violating EUT. Finally, with respect to household income, Table 1 does not suggest a clear effect.

Moreover, refer to the rightmost column labeled “p-value, χ 2” in Table 1 that shows p-levels of χ 2 tests for differences between proportions of violating and non-violating subjects in the category listed in column 1.Footnote 11 The χ 2 tests indicate the strongest differences in violation behavior in the categories of education, occupation and household income.

2.2 Econometrics and the role of socioeconomic characteristics

To test for across-treatment differences controlling for subjects’ sociodemographic characteristics and to check whether any of these characteristics are correlated with behavior, we ran probit regressions with the variable “Violate” as the dependent variable. “Violate” is equal to 1 if a subject’s answer to the Allais questions violates EUT (i.e., answers A  ∗  B or AB  ∗ ), and is equal to 0 otherwise (i.e., answers AB or A  ∗  B  ∗ ). The background variables we include in the regression are the ones shown in Table 1 above. The results are shown in Table 3 which reports marginal effects. Regression (1) includes all data whereas regressions (2) to (4) show results for each of the three treatments separately. Recall from the end of Section 2 that we did not find evidence for a selection bias due to non-response.

Table 3 Results of probit regressions on violation of EUT

Let us first briefly reconsider across-treatment differences. For this purpose, refer to regression (1) in Table 3 which includes all data and controls for background variables. Importantly, note that in regression (1) the omitted treatment dummy is the one for LowHyp. Inspecting the treatment coefficients, we note that the coefficient for HighHyp is positive and big (0.302) and highly statistically significant whereas the coefficient of LowReal is also positive (0.053) but rather small and only borderline significant.

To analyze the effect of socioeconomic background variables econometrically, we examine regression (1) in Table 3. We make the following observations.

  • Controlling for other characteristics, gender and age have no significant influence on the extent of EUT violation.Footnote 12

  • Regarding education, we find a strong tendency for violations to be reduced with further education.Footnote 13 Overall, there is a strong effect of higher education that also shows in the separate specifications for both treatments with low payoffs. In LowHyp everything that improves on primary education goes hand in hand with reduced violations. Only in HighHyp there is no effect of education. This suggests an interesting interaction effect of experience with a decision domain and education. In the absence of any experience (as in HighHyp) education on its own does little to improve performance. Only if coupled with experience education is aligned with consistency.

  • Of the various occupational affiliations listed in Table 3, we find that the unemployed and ‘others’ do much worse than the employed, self-employed and freelancers.Footnote 14 This is more pronounced in treatments with hypothetical payoffs.

  • Regarding income, we notice that having a higher gross monthly household income (vis-à-vis the control group with the lowest gross monthly household income) goes along with reduced EUT violations.Footnote 15 Interestingly, this is particularly pronounced in the treatment LowReal when actual money is at stake. (One could have conjectured that it would be the other way round as the marginal utility of making some money and, hence, the incentive to think a little harder might be higher for those on low incomes. Alas, it does not work this way.)

  • Finally, subjects holding assets have significantly lower EUT violations (by about 8%) whereas subjects with a savings account have significantly higher EUT violations (by about 5%). Maybe not surprisingly, subjects holding assets tend to be expected value maximizers (mainly choosing A  ∗  B  ∗ ) while subjects who only have a savings account display “Allais” behavior tending toward the choice of AB  ∗ .Footnote 16

In all a picture emerges that is reminiscent of recent studies by Benjamin et al. (2006), Burks et al. (2009) and Dohmen et al. (2010) who show that a range of behavioral biases are correlated with (or may even stem from) cognitive limitations and low IQ. We find that violations are more prevalent in those who are lowly educated, unemployed, on low income, and who have no significant asset holdings. This is, of course, particularly worrying as imprudent financial decision making and bad planning for retirement has the worst consequences in that group.

In Appendix C we complement the above analysis by running multinomial logit regressions using all four answers AB, A  ∗  B  ∗ , AB  ∗ , and A  ∗  B, and choosing the answer representing expected value maximization, A  ∗  B  ∗ , as the base outcome. The results (whose interpretation is less straightforward) are shown in Tables 10, 11, 12 and 13.

3 The lab experiment

As mentioned in the introduction, the third dimension of our research strategy is concerned with the external validity of laboratory experiments that are typically carried out with rather homogenous subject pools. Of course, the preceding section has shown that there are important sources of heterogeneity in the population at large that simply cannot be detected when the subject pool is restricted to students. The same is, of course, true for any highly selected convenience sample. But what about the questions we analyzed first—the effects of different treatments, the differences between high and low and real and hypothetical payoffs? Would a lab experiment give us reliable results to analyze such questions (as it has been implicitly assumed for a long time in the experimental community, perhaps negligently without much testing)? To shed more light on these issues we conducted an additional lab experiment in the laboratory of Tilburg University using Dutch speaking student subjects drawn from the normal subject pool.

The lab experiment was conducted in the same way as the experiment using the CentERpanel. That is, student subjects did the experiment using a web browser in the lab and using the same screens as the subjects in the panel. However, there were two small exceptions. First, lab subjects received a 10 Euro show-up fee. (Potential participants were informed about this in the invitation E-mail.) But of course, mirroring the panel design again, only subjects assigned to treatments with real payment had the chance to earn additional money during the experiment. This was not announced prior to the experiment. Second, lab subjects were not offered the choice of not participating in the experiment once they had reported to the lab and the experiment was started. This was done in an effort to mimic the normal procedures in lab experiments where by reporting to the lab, a subject usually confirms his or her decision to participate. Note that when we move from the panel to the lab sample, both the subject pool and the environment changes. We deliberately accepted these two simultaneous changes as our aim was to contrast the results obtained in the panel with those obtained in a normal lab experiment.Footnote 17

After the experiment we asked subjects to fill in a questionnaire in which we elicited some basic background information. Naturally, the information we collected from lab subjects is very limited and cannot be compared in scope and quality to the background information available from members of CentERpanel. The lab experiments were conducted in December 2006 using 223 subjects in total.

As in the panel experiment we did not observe any order effects of presenting the Allais questions, so we present only pooled data in Table 4 which shows the same information for the lab data that Table 2 showed for the panel. We make the following observations. First, as in the panel experiments, we observe EUT violations in all treatments, although to a much lesser degree.Footnote 18 This mirrors the main result in Gächter et al.’s (2008) meta-study: Violations from orthodox theoretical predictions and biases observed in the lab form a lower bound for violations and biases observed in the population at large. Second, as in the panel, moving from high hypothetical payoffs to low hypothetical payoffs reduces the extent of EUT violation significantly ( p < 0.001, D = 4.881). Third, moving from low hypothetical payoffs to low real payoffs increases the extent of EUT violation slightly but insignificantly (p < 0.226, D = − 0.7525). The similarities between the observations in the panel and in the lab are evident.

Table 4 Summary of experimental results in the lab

Figure 1 shows the shares of choices violating EUT in the two subsamples. It appears that the graph indicating the share of EUT violation in the panel can quite accurately be obtained by shifting the graph indicating the share of EUT violation in the lab upwards by about 15 percentage points.Footnote 19 This means that although the share of EUT violations is consistently higher in the panel than in the lab, the comparative statics results of moving from one treatment to another could have been reliably predicted by the lab experiments.

Fig. 1
figure 1

The share of choices violating EUT in the panel and the lab. Note: HighHyp stands for high hypothetical payoffs, LowHyp for low hypothetical payoffs, and LowReal for low real payoffs

4 Conclusions

Using a representative sample of the Dutch population we revisit the Allais paradox. Our main results are threefold. First, as in previous lab samples, the violations of EUT are systematic in the population at large and much lower when stakes are low. Second, there is considerable heterogeneity in the population and violations are particularly prevalent among the lowly educated, those poor in income and asset holdings, and the unemployed. Third, comparing the panel results with a laboratory experiment we find that the relative treatment differences are identical in the panel and the lab but violation rates in all lab treatments are about 15 percentage points lower than in the corresponding non-lab treatment.

Our findings appear to imply two general messages. First, laboratory experiments with convenience samples of students might be more useful to study relative effects rather than absolute levels (see also Levitt and List (2007) who make a similar point in the context of social preferences). When it comes to the absolute measurement of behavior, it appears that lab results will draw a too optimistic picture. The population at large, it turns out, is less consistent with EUT than student samples are. Second, our results suggest that the predictive power of EUT in a general population is correlated with socioeconomic characteristics. In particular, parts of the population that are more likely to experience economic hardship are less consistent.

Of course, there exists a large literature on non-expected utility theories such as Kahneman and Tversky’s (1979) prospect theory or Machina’s (1982) fanning-out theory (both of which can explain the Allais paradox) or Viscusi’s (1989) prospective reference theory which predicts the paradox. Earlier laboratory experiments (see Camerer (1995) or Starmer (2000) for surveys) have documented the Allais paradox in student samples. Our paper highlights that, if anything, these studies underestimate the true prevalence of the paradox in general populations and indicates how violations are correlated with observable characteristics.