Simple models predict behavior at least as well as behavioral scientists. Dillon Bowen. arXiv, August 3, 2022. https://arxiv.org/abs/2208.01167
Abstract: How accurately can behavioral scientists predict behavior? To answer this question, we analyzed data from five studies in which 640 professional behavioral scientists predicted the results of one or more behavioral science experiments. We compared the behavioral scientists’ predictions to random chance, linear models, and simple heuristics like “behavioral interventions have no effect” and “all published psychology research is false.” We find that behavioral scientists are consistently no better than - and often worse than - these simple heuristics and models. Behavioral scientists’ predictions are not only noisy but also biased. They systematically overestimate how well behavioral science “works”: overestimating the effectiveness of behavioral interventions, the impact of psychological phenomena like time discounting, and the replicability of published psychology research
Keywords: Forecasting, Behavioral science
3 Discussion
Critical public policy decisions depend on predictions from behavioral scientists. In this
paper, we asked how accurate those predictions are. To answer this question, we compared
the predictions of 640 behavioral scientists to those of simple mathematical models on five
prediction tasks. Our sample included a variety of behavioral scientists: economists, psychologists, and business professionals from academia, industry, and government. The prediction
tasks also covered various domains, including text-message interventions to increase vaccination rates, behavioral nudges to increase exercise, randomized control trials, incentives
to encourage effort, and attempts to reproduce published psychology studies. The models
to which we compared the behavioral scientists were deliberately simple, such as random
chance, linear interpolation, and heuristics like “behavioral interventions have no effect” and
“all published psychology research is false.”
We consistently found that behavioral scientists are no better than - and often worse
than - these simple heuristics and models. In the exercise, flu, and RCT studies, null models
significantly outperformed behavioral scientists. These null models assume that behavioral
treatments have no effect; behavioral interventions will not increase weekly gym visits, text messages will not increase vaccination rates, and nudges will not change behavior. As we can
see in Table 1, compared to behavioral scientists, null models are nearly indistinguishable
from the oracle.
In the effort study, linear interpolations performed at least as well as professional economists.
These interpolations assumed that all psychological phenomena are inert; people do not exhibit risk aversion, time discounting, or biases like framing effects.
In the reproducibility study, professional psychologists’ Brier scores were virtually identical to those of a null model, which assumed that all published psychology research is false.
Professional psychologists were significantly worse than both linear regression and random
chance.
Notably, the linear regression model used data from the reproducibility study, which were
not accessible to psychologists during their participation. While this is not a fair comparison,
we believe it is a useful comparison, as the linear regression model can serve as a benchmark
for future attempts to predict reproducibility.
Why is it so hard for behavioral scientists to outperform simple models? One possible
answer is that human predictions are noisy while model predictions are not [Kahneman et al.,
2021]. Indeed, there is likely a selection bias in the prediction tasks we analyzed. Recall that
most of the prediction tasks asked behavioral scientists to predict the results of ongoing
or recently completed studies. Behavioral scientists presumably spend time researching
questions that have not been studied exhaustively and do not have obvious answers. In this
case, the prediction tasks were likely exceptionally challenging, and behavioral scientists’
expertise would be of little use.
However, behavioral scientists’ predictions are not only noisy but also biased. Previous
research noted that behavioral scientists overestimate the effectiveness of nudges [DellaVigna
and Linos, 2022, Milkman et al., 2021]. Our research extends these findings, suggesting that
behavioral scientists believe behavioral science generally “works” better than it does. Behavioral scientists overestimated the effectiveness of behavioral interventions in the exercise, flu, and RCT studies. In the exercise study, behavioral scientists significantly overestimated
the effectiveness of all 53 treatments, even after correcting for multiple testing. Economists
overestimated the impact of psychological phenomena in the effort study, especially for motivational crowding out, time discounting, and social preferences. Finally, psychologists
significantly overestimated the replicability of published psychology research in the reproducibility study. In general, behavioral scientists overestimate not only the effect of nudges,
but also the impact of psychological phenomena and the replicability of published behavioral
science research.
Behavioral scientists’ bias can have serious consequences. A recent study found that
policymakers were less supportive of an effective climate change policy (carbon taxes) when
a nudge solution was also available [Hagmann et al., 2019]. However, accurately disclosing
the nudge’s impact shifted support back towards carbon taxes and away from the nudge
solution. In general, when behavioral scientists exaggerate the effectiveness of their work,
they may drain support and resources from potentially more impactful solutions.
Our results raise many additional questions. For example, is it only behavioral scientists
who are biased, or do people, in general, overestimate how well behavioral science works?
The general public likely has little exposure to RCTs, social science experiments, and academic psychology publications, so there is no reason to expect that they are biased in either
direction. Then again, the little exposure they have had likely gives an inflated impression
of behavioral science’s effectiveness. For example, a TED talk with 64 million as of May
2022 touted the benefits of power posing, whereby one can reap the benefits of improved
self-confidence and become more likely to succeed in life by adopting a powerful pose for one
minute [Carney et al., 2010, Cuddy, 2012]. However, the power posing literature was based
on p-hacked results [Simmons and Simonsohn, 2017], and researchers have since found that
power posing yields no tangible benefits [Jonas et al., 2017].
Additionally, people may generally overestimate effects due to the “What you see is
all there is” (WYSIATI) bias [Kahneman, 2011]. For example, the exercise study asked behavioral scientists to consider, among other treatments, how much more people would
exercise if researchers told them they were “gritty.” After the initial “gritty diagnosis,”
dozens of other factors determined how often participants in that condition went to the gym
during the following four-week intervention period. Work schedule, personal circumstances,
diet, mood changes, weather, and many other factors also played key roles. These other
factors may not have even crossed the behavioral scientists’ minds. The WYSIATI bias
may have caused them to focus on the treatment and ignore the noise of life that tempers
the treatment’s signal. Of course, this bias is likely to cause everyone, not only behavioral
scientists, to overestimate the effectiveness of behavioral interventions and the impact of
psychological phenomena.
If people generally overestimate how well behavioral science works, are they more or
less biased than behavioral scientists? Experimental economics might suggest that behavioral scientists are less biased because people with experience tend to be less biased in their
domain of expertise. For example, experienced sports card traders are less susceptible to
the endowment effect [List, 2004], professional traders exhibit less ambiguity aversion than
novices [List and Haigh, 2010], experienced bidders are immune to the winner’s curse [Harrison and List, 2008], and CEOs who regularly make high-stakes decisions are less susceptible
to possibility and certainty effects [List and Mason, 2011]. Given that most people have zero
experience with behavioral science, they should be more biased than behavioral scientists.
Then again, there are at least three reasons to believe that behavioral scientists should be
more biased than the general population: selection bias, selective exposure, and motivated
reasoning. First, behavioral science might select people who believe in its effectiveness. On
the supply side, students who apply to study psychology for five years on a measly PhD
stipend are unlikely to believe that most psychology publications fail to replicate. On the
demand side, marketing departments and nudge units may be disinclined to hire applicants
who believe their work is ineffective. Indeed, part of the experimental economics argument
is that markets filter out people who make poor decisions [List and Millimet, 2008]. The opposite may be true of behavioral science: the profession might filter out people with an
accurate assessment of how well behavioral science works.
Second, behavioral scientists are selectively exposed to research that finds large and statistically significant effects. Behavioral science journals and conferences are more likely to
accept papers with significant results. Therefore, most of the literature behavioral scientists
read promotes the idea that behavioral interventions are effective and psychological phenomena substantially influence behavior. However, published behavioral science research often
fails to replicate. Lack of reproducibility plagues not only behavioral science [Collaboration, 2012, 2015, Camerer et al., 2016, Mac Giolla et al., 2022] but also medicine [Freedman
et al., 2015, Prinz et al., 2011], neuroscience [Button et al., 2013], and genetics [Hewitt, 2012,
Lawrence et al., 2013]. Scientific results fail to reproduce for many reasons, including publication bias, p-hacking, and fraud [Simmons et al., 2011, Nelson et al., 2018]. Indeed, most
evidence that behavioral scientists overestimate how well behavioral science works involves
asking them to predict the results of nudge studies. However, there is little to no evidence
that nudges work after correcting for publication bias [Maier et al., 2022]. Even when a
study successfully replicates, the effect size in the replication study is often much smaller
than that reported in the original publication [Camerer et al., 2016, Collaboration, 2015].
For example, the RCT study paper estimates that the academic literature overstates nudges’
effectiveness by a factor of six [DellaVigna and Linos, 2022].
Finally, behavioral scientists might be susceptible to motivated reasoning [Kunda, 1990,
Epley and Gilovich, 2016]. As behavioral scientists, we want to believe that our work is meaningful, effective, and true. Motivated reasoning may also drive selective exposure [B´enabou
and Tirole, 2002]. We want to believe our work is effective, so we disproportionately read
about behavioral science experiments that worked.
Our analysis finds mixed evidence of the relationship between experience and bias in behavioral science. The RCT study informally examined the relationship between experience
and bias for behavioral scientists predicting nudge effects and concluded that more experienced scientists were less biased. While we also estimate that more experienced scientists are
less biased, we do not find statistically significant pairwise differences between the novice,
moderately experienced, and most experienced scientists.
Even if the experimental economics argument is correct that behavioral scientists are
less biased than the general population, why are behavioral scientists biased at all? The experimental economics literature identifies two mechanisms to explain why more experienced
people are less biased [List, 2003, List and Millimet, 2008]. First, markets filter out people
who make poor decisions. Second, experience teaches people to think and act more rationally. We have already discussed that the first mechanism might not apply to behavioral
science. And, while our results are consistent with the hypothesis that behavioral scientists
learn from experience, they still suggest that even the most experienced behavioral scientists
overestimate the effectiveness of nudges. The remaining bias for the most experienced scientists is larger than the gap between the most experienced scientists and novices. Why has
experience not eliminated this bias entirely? Perhaps the effect of experience competes with
the forces of “What you see is all there is,” selection bias, selective exposure, and motivated
reasoning such that experience mitigates but does not eliminate bias in behavioral science.
Finally, how can behavioral scientists better forecast behavior? One promising avenue is
to use techniques that help forecasters predict political events [Chang et al., 2016, Mellers
et al., 2014]. For example, the best political forecasters begin with base rates and then adjust
their predictions based on information specific to the event they are forecasting [Tetlock and
Gardner, 2016]. Behavioral scientists’ predictions would likely improve by starting with the
default assumptions that behavioral interventions have no effect, psychological phenomena
do not influence behavior, and published psychology research has a one in three chance of
replicating [Collaboration, 2012]. Even though these assumptions are wrong, they are much
less wrong than what behavioral scientists currently believe.