Saturday, September 10, 2022

Behavioral scientists are consistently no better than, and often worse than, simple heuristics and models; why have markets & experience not eliminated their biases entirely?

Simple models predict behavior at least as well as behavioral scientists. Dillon Bowen. arXiv, August 3, 2022. https://arxiv.org/abs/2208.01167

Abstract: How accurately can behavioral scientists predict behavior? To answer this question, we analyzed data from five studies in which 640 professional behavioral scientists predicted the results of one or more behavioral science experiments. We compared the behavioral scientists’ predictions to random chance, linear models, and simple heuristics like “behavioral interventions have no effect” and “all published psychology research is false.” We find that behavioral scientists are consistently no better than - and often worse than - these simple heuristics and models. Behavioral scientists’ predictions are not only noisy but also biased. They systematically overestimate how well behavioral science “works”: overestimating the effectiveness of behavioral interventions, the impact of psychological phenomena like time discounting, and the replicability of published psychology research

Keywords: Forecasting, Behavioral science

3 Discussion
Critical public policy decisions depend on predictions from behavioral scientists. In this paper, we asked how accurate those predictions are. To answer this question, we compared the predictions of 640 behavioral scientists to those of simple mathematical models on five prediction tasks. Our sample included a variety of behavioral scientists: economists, psychologists, and business professionals from academia, industry, and government. The prediction tasks also covered various domains, including text-message interventions to increase vaccination rates, behavioral nudges to increase exercise, randomized control trials, incentives to encourage effort, and attempts to reproduce published psychology studies. The models to which we compared the behavioral scientists were deliberately simple, such as random chance, linear interpolation, and heuristics like “behavioral interventions have no effect” and “all published psychology research is false.” We consistently found that behavioral scientists are no better than - and often worse than - these simple heuristics and models. In the exercise, flu, and RCT studies, null models significantly outperformed behavioral scientists. These null models assume that behavioral treatments have no effect; behavioral interventions will not increase weekly gym visits, text messages will not increase vaccination rates, and nudges will not change behavior. As we can see in Table 1, compared to behavioral scientists, null models are nearly indistinguishable from the oracle. In the effort study, linear interpolations performed at least as well as professional economists. These interpolations assumed that all psychological phenomena are inert; people do not exhibit risk aversion, time discounting, or biases like framing effects. In the reproducibility study, professional psychologists’ Brier scores were virtually identical to those of a null model, which assumed that all published psychology research is false. Professional psychologists were significantly worse than both linear regression and random chance. Notably, the linear regression model used data from the reproducibility study, which were not accessible to psychologists during their participation. While this is not a fair comparison, we believe it is a useful comparison, as the linear regression model can serve as a benchmark for future attempts to predict reproducibility. Why is it so hard for behavioral scientists to outperform simple models? One possible answer is that human predictions are noisy while model predictions are not [Kahneman et al., 2021]. Indeed, there is likely a selection bias in the prediction tasks we analyzed. Recall that most of the prediction tasks asked behavioral scientists to predict the results of ongoing or recently completed studies. Behavioral scientists presumably spend time researching questions that have not been studied exhaustively and do not have obvious answers. In this case, the prediction tasks were likely exceptionally challenging, and behavioral scientists’ expertise would be of little use. However, behavioral scientists’ predictions are not only noisy but also biased. Previous research noted that behavioral scientists overestimate the effectiveness of nudges [DellaVigna and Linos, 2022, Milkman et al., 2021]. Our research extends these findings, suggesting that behavioral scientists believe behavioral science generally “works” better than it does. Behavioral scientists overestimated the effectiveness of behavioral interventions in the exercise, flu, and RCT studies. In the exercise study, behavioral scientists significantly overestimated the effectiveness of all 53 treatments, even after correcting for multiple testing. Economists overestimated the impact of psychological phenomena in the effort study, especially for motivational crowding out, time discounting, and social preferences. Finally, psychologists significantly overestimated the replicability of published psychology research in the reproducibility study. In general, behavioral scientists overestimate not only the effect of nudges, but also the impact of psychological phenomena and the replicability of published behavioral science research. Behavioral scientists’ bias can have serious consequences. A recent study found that policymakers were less supportive of an effective climate change policy (carbon taxes) when a nudge solution was also available [Hagmann et al., 2019]. However, accurately disclosing the nudge’s impact shifted support back towards carbon taxes and away from the nudge solution. In general, when behavioral scientists exaggerate the effectiveness of their work, they may drain support and resources from potentially more impactful solutions. Our results raise many additional questions. For example, is it only behavioral scientists who are biased, or do people, in general, overestimate how well behavioral science works? The general public likely has little exposure to RCTs, social science experiments, and academic psychology publications, so there is no reason to expect that they are biased in either direction. Then again, the little exposure they have had likely gives an inflated impression of behavioral science’s effectiveness. For example, a TED talk with 64 million as of May 2022 touted the benefits of power posing, whereby one can reap the benefits of improved self-confidence and become more likely to succeed in life by adopting a powerful pose for one minute [Carney et al., 2010, Cuddy, 2012]. However, the power posing literature was based on p-hacked results [Simmons and Simonsohn, 2017], and researchers have since found that power posing yields no tangible benefits [Jonas et al., 2017]. Additionally, people may generally overestimate effects due to the “What you see is all there is” (WYSIATI) bias [Kahneman, 2011]. For example, the exercise study asked behavioral scientists to consider, among other treatments, how much more people would exercise if researchers told them they were “gritty.” After the initial “gritty diagnosis,” dozens of other factors determined how often participants in that condition went to the gym during the following four-week intervention period. Work schedule, personal circumstances, diet, mood changes, weather, and many other factors also played key roles. These other factors may not have even crossed the behavioral scientists’ minds. The WYSIATI bias may have caused them to focus on the treatment and ignore the noise of life that tempers the treatment’s signal. Of course, this bias is likely to cause everyone, not only behavioral scientists, to overestimate the effectiveness of behavioral interventions and the impact of psychological phenomena. If people generally overestimate how well behavioral science works, are they more or less biased than behavioral scientists? Experimental economics might suggest that behavioral scientists are less biased because people with experience tend to be less biased in their domain of expertise. For example, experienced sports card traders are less susceptible to the endowment effect [List, 2004], professional traders exhibit less ambiguity aversion than novices [List and Haigh, 2010], experienced bidders are immune to the winner’s curse [Harrison and List, 2008], and CEOs who regularly make high-stakes decisions are less susceptible to possibility and certainty effects [List and Mason, 2011]. Given that most people have zero experience with behavioral science, they should be more biased than behavioral scientists. Then again, there are at least three reasons to believe that behavioral scientists should be more biased than the general population: selection bias, selective exposure, and motivated reasoning. First, behavioral science might select people who believe in its effectiveness. On the supply side, students who apply to study psychology for five years on a measly PhD stipend are unlikely to believe that most psychology publications fail to replicate. On the demand side, marketing departments and nudge units may be disinclined to hire applicants who believe their work is ineffective. Indeed, part of the experimental economics argument is that markets filter out people who make poor decisions [List and Millimet, 2008]. The opposite may be true of behavioral science: the profession might filter out people with an accurate assessment of how well behavioral science works. Second, behavioral scientists are selectively exposed to research that finds large and statistically significant effects. Behavioral science journals and conferences are more likely to accept papers with significant results. Therefore, most of the literature behavioral scientists read promotes the idea that behavioral interventions are effective and psychological phenomena substantially influence behavior. However, published behavioral science research often fails to replicate. Lack of reproducibility plagues not only behavioral science [Collaboration, 2012, 2015, Camerer et al., 2016, Mac Giolla et al., 2022] but also medicine [Freedman et al., 2015, Prinz et al., 2011], neuroscience [Button et al., 2013], and genetics [Hewitt, 2012, Lawrence et al., 2013]. Scientific results fail to reproduce for many reasons, including publication bias, p-hacking, and fraud [Simmons et al., 2011, Nelson et al., 2018]. Indeed, most evidence that behavioral scientists overestimate how well behavioral science works involves asking them to predict the results of nudge studies. However, there is little to no evidence that nudges work after correcting for publication bias [Maier et al., 2022]. Even when a study successfully replicates, the effect size in the replication study is often much smaller than that reported in the original publication [Camerer et al., 2016, Collaboration, 2015]. For example, the RCT study paper estimates that the academic literature overstates nudges’ effectiveness by a factor of six [DellaVigna and Linos, 2022]. Finally, behavioral scientists might be susceptible to motivated reasoning [Kunda, 1990, Epley and Gilovich, 2016]. As behavioral scientists, we want to believe that our work is meaningful, effective, and true. Motivated reasoning may also drive selective exposure [B´enabou and Tirole, 2002]. We want to believe our work is effective, so we disproportionately read about behavioral science experiments that worked. Our analysis finds mixed evidence of the relationship between experience and bias in behavioral science. The RCT study informally examined the relationship between experience and bias for behavioral scientists predicting nudge effects and concluded that more experienced scientists were less biased. While we also estimate that more experienced scientists are less biased, we do not find statistically significant pairwise differences between the novice, moderately experienced, and most experienced scientists. Even if the experimental economics argument is correct that behavioral scientists are less biased than the general population, why are behavioral scientists biased at all? The experimental economics literature identifies two mechanisms to explain why more experienced people are less biased [List, 2003, List and Millimet, 2008]. First, markets filter out people who make poor decisions. Second, experience teaches people to think and act more rationally. We have already discussed that the first mechanism might not apply to behavioral science. And, while our results are consistent with the hypothesis that behavioral scientists learn from experience, they still suggest that even the most experienced behavioral scientists overestimate the effectiveness of nudges. The remaining bias for the most experienced scientists is larger than the gap between the most experienced scientists and novices. Why has experience not eliminated this bias entirely? Perhaps the effect of experience competes with the forces of “What you see is all there is,” selection bias, selective exposure, and motivated reasoning such that experience mitigates but does not eliminate bias in behavioral science. Finally, how can behavioral scientists better forecast behavior? One promising avenue is to use techniques that help forecasters predict political events [Chang et al., 2016, Mellers et al., 2014]. For example, the best political forecasters begin with base rates and then adjust their predictions based on information specific to the event they are forecasting [Tetlock and Gardner, 2016]. Behavioral scientists’ predictions would likely improve by starting with the default assumptions that behavioral interventions have no effect, psychological phenomena do not influence behavior, and published psychology research has a one in three chance of replicating [Collaboration, 2012]. Even though these assumptions are wrong, they are much less wrong than what behavioral scientists currently believe.

Both laypersons & police officers were worse at detecting deception when judging handcuffed suspects compared to non-handcuffed suspects, while not affecting their judgement bias; police officers were also overconfident in their judgements

Looking guilty: Handcuffing suspects influences judgements of deception. Mircea Zloteanu,Nadine L. Salman,Eva G. Krumhuber,Daniel C. Richardson. Journal of Investigative Psychology and Offender Profiling, September 7 2022. https://doi.org/10.1002/jip.1597

Abstract: Veracity judgements are important in legal and investigative contexts. However, people are poor judges of deception, often relying on incorrect behavioural cues when these may reflect the situation more than the sender's internal state. We investigated one such situational factor relevant to forensic contexts: handcuffing suspects. Judges—police officers (n = 23) and laypersons (n = 83)—assessed recordings of suspects, providing truthful and deceptive responses in an interrogation setting where half were handcuffed. Handcuffing was predicted to undermine efforts to judge veracity by constraining suspects' gesticulation and by priming stereotypes of criminality. It was found that both laypersons and police officers were worse at detecting deception when judging handcuffed suspects compared to non-handcuffed suspects, while not affecting their judgement bias; police officers were also overconfident in their judgements. The findings suggest that handcuffing can negatively impact veracity judgements, highlighting the need for research on situational factors to better inform forensic practice.

7 DISCUSSION

The present research explored whether a situational factor related to interrogation procedures (i.e., the use of handcuffs on suspects) can negatively impact veracity judgements. Confirming our hypothesis, the handcuffing manipulation affected both laypersons' and police officers' ability to detect deception (i.e., H2 was supported; moderate effect size). Statements made by handcuffed suspects were harder to classify for both police officers and laypersons. Converting the handcuffing effect size (ξ = 0.37) to more intuitive estimates (as recommended by Fritz et al., 2012), we obtain a Number Needed to Treat (NNT) of 5.01. Meaning for every fifth person that is interviewed wearing handcuffs we would expect one more misclassification of veracity. Or, based on the Common Language (CL) effect size, the probability that a suspect selected at random from the handcuffed condition is misclassified in terms of statement veracity compared to a suspect from the non-handcuffed condition is 64.3%. This decrease in accuracy was attributable to the study's manipulation affecting veracity discriminability rather than a shift in judgement response tendencies (H1 was not supported), as all judges remained truth-biased overall (H3 was not supported; NNT = 10.54, CL = 56.7%). For both judge groups, truths were easier to detect than lies (NNT = 12.02, CL = 55.9%; replicating the veracity effect; Levine et al., 1999).

Unsurprisingly, police officers did not perform better at judging veracity than laypersons (see Aamodt & Custer, 2006), and judging handcuffed suspects made this process even harder. However, the manipulation did not affect officers' response bias (H5 was not supported). This contrasts research arguing for a veracity detection reversal in professionals (i.e., police officers showing higher lie detection, but lower truth detection compared to laypersons; Meissner & Kassin, 2002). The similarity in response patterns with laypersons indicates that police officers were not overall more suspicious of suspects. This could, however, be due to the relatively junior sample of officers recruited (see Table 1), or, potentially, due to the “suspects” being naïve students which may have mitigated lie bias towards them; however, we note that the instructions never mention the status of suspects.

A more worrying result, and per our prediction, police officers displayed higher confidence while being no more accurate than laypersons (i.e., H4 was supported; moderate-to-large effect size; NNT = 3.66, CL = 70.2%), even showing a trend towards lower accuracy (e.g., below chance lie detection; NNT = 5.88, CL = 62.2%). This parallels findings of professionals tending to be overconfident in their veracity judgements (Aamodt & Custer, 2006; DePaulo & Pfeifer, 1986; Masip et al., 2016). While the police officers' level of experience may have not been sufficient to bias their judgements in the direction of a lie, it was able to increase their confidence in catching liars (e.g., Masip et al., 2016).

Overall, judges performed worse at discriminating veracity when viewing handcuffed suspects, supporting our assertions that situational factors can negatively impact the discriminability between deceptive and honest suspects (for a more detailed breakdown of the honesty scale data, see SI). Such effects may have serious ramifications for the forensic domain (Verschuere et al., 2016), especially when considering the already poor deception detection rates in the absence of the handcuffing manipulation. Interestingly, both laypersons and police officers were less confident in their judgements when they watched the handcuffed (vs. non-handcuffed) videos (NNT = 5.32, CL = 63.6%). Judges may have found deception detection more difficult when suspects were handcuffed, tempering their confidence.

These results illustrate that situational elements can impact the perception and judgement of both laypersons and police officers. Reducing the impact of such artificial factors could improve forensic practices and deception detection procedures, whilst reducing the risk of potential miscarriages of justice. Such effects are especially pertinent in situations of judgement under uncertainty where external and contextual information often influence the perception of ambiguous or ambivalent information (Masip et al., 2009; Mobbs et al., 2006). In line with research on investigative interviewing, it would seem recommendable that the space and circumstances under which an interrogation takes place are comfortable and do not restrict the individual (Goodman-Delahunty et al., 2014; Kelly et al., 2013).

7.1 Future directions

The current work sought to highlight the effects of situational factors on veracity judgements, particularly in forensic contexts. Future research could elaborate on the different ways in which handcuffing affects senders and judges by separating their influence on suspect perceptions (e.g., handcuffs as a visual cue of criminality; Stiff et al., 1992) from the effect on suspects' ability to gesticulate (within-sender features). For this, handcuffed and non-handcuffed suspects' movements could be restricted by asking them, for example, to place their hands flat on a table throughout the interrogation. This would equate the nonverbal differences whilst having the presence/absence of handcuffs as the only factor that differs between conditions. Alternatively, the videos could be edited to show the same suspect with or without handcuffs, revealing whether any impressions brought about by being handcuffed are due to the presence of external visual cues.

Considerations should also be given to the content of the stimuli themselves. An analysis of the videos may reveal verbal, paraverbal, and/or nonverbal cues which may aid in understanding the current findings. Such an investigation could uncover if behavioural differences between the liars and truth-tellers are indeed reduced by handcuffing and if differences in impression management are brought about by the manipulation (e.g., handcuffed suspects may “compensate” for their restricted gesticulation by modifying their speech and, by extension, their verbal cues may differ; see Verschuere et al., 2021).

Additionally, given the within-sender variability typically seen in deception research (Levine, 2010; Zloteanu, Bull, et al., 2021), the current stimulus set may be expanded to show a larger number of senders which would provide more precise effect size estimates and reduced uncertainty (Levine et al., 2022). Future research should also employ a more in-depth statistical approach (i.e., multi-level modelling) that accounts for both sender and decoder variability. This may be especially relevant in understanding if handcuffing interacts with senders' demeanour and judges' expectations. The possibility exists that the manipulation may not affect all individuals to the same degree or in the same manner (see DAG in SI for the potential influence of within/between subject and stimuli variance on the judgement process).

Subsequent work may also explore the effect of handcuffing on the relationship quality between suspect and interrogator (also, see SI). Due to the interactive nature of the interrogation task, handcuffs may have affected the rapport between the interrogator and suspect, which in turn could shape the behaviour of suspects (Kassin et al., 2003; Paton et al., 2018). The present manipulation demonstrates that deception detection does not happen in isolation. Future studies investigating veracity judgements should expand the range of factors being considered, both within the lab and in the real world.

7.2 Limitations

The issue of generalisability in the deception field is rarely addressed; nonetheless, a few elements of the current research must be considered. First, the type of lie told by suspects related to personal information that liars misrepresented. It can be argued that differences in performance and judgement may emerge if other types of lies (e.g., lies about transgressions) are employed (Levine, Kim, & Blair, 2010; cf. Hartwig & Bond, 2014; Hauch et al., 2014). Second, although some have argued that using students instead of real suspects may impact the detection rate (see O’Sullivan et al., 2009), both empirical investigations and meta-analyses report that deception detection is unaffected by whether the sender is a student or not (Hartwig & Bond, 2014; Zhang et al., 2013), nor do police officers show better accuracy rates even in naturalistic high-stakes settings (Hartwig, 2004; Meissner & Kassin, 2002). However, using different type of senders may influence perceptions and judgements.

Presently, it is difficult to separate the effect of handcuffing on judges' perception (i.e., pure external features) from that on sender performance (i.e., within-sender features) as our manipulation may have been affecting either or both. For example, handcuffing could attenuate behavioural differences between liars and truth-tellers resulting in poorer overall veracity discrimination. However, considering the dynamics between the interrogator and the suspects, being handcuffed could have also prompted senders as to the added scrutiny and behavioural restrictions, and compensated through increased impression management to produce a more convincing performance (Buller & Burgoon, 1996; Burgoon et al., 1996). The interplay between the interviewee and the interviewer is an important unknown, as some response variability may be due to the interrogator himself, given that rapport strongly influences interviewing outcomes (Abbe & Brandon, 2013).

The interrogation style used should also be weighed. Currently, while we did not find any effect of probing, this element could not be explored in depth due to a lack of variability in the use of the three probes by the interrogator (see SI). The literature on probing is equivocal on its use impacting veracity judgements (Buller et al., 1991). Nonetheless, it may impact rapport building and disclosure (Paton et al., 2018). Different probes may result in changes in the interdynamics of the interrogator and suspect, as well as subsequent judges (e.g., biasing impressions based on the valence of the probe used during the questioning). Future research could consider manipulating (e.g., standardising) the probing element to investigate how it interacts with the handcuffing element (e.g., Granhag & Strömwall, 2001); specific probes may bolster (e.g., negative) or attenuate (e.g., positive) the effects of handcuffing.

Finally, a more pronounced limitation is the relatively small and unbalanced sample. Underpowered studies are less likely to find true effects (i.e., Type II error), have a higher chance of found effects being statistical artefacts (i.e., Type I error), inflate estimates of true effects (i.e., Type M error), and have lower replicability (Fraley & Vazire, 2014; Gelman & Carlin, 2014). For instance, the CIs around the handcuffing effect indicate that the data is compatible with a wide range of effect sizes, from large and of potential interest (ξ = 0.58) to small and potentially unimportant (ξ = 0.10). Thus, we advise readers to interpret the results with care. Still, considering the forensic-relevant sample alongside the implications of our findings (especially for miscarriages of justice), on balance, we consider that the value of the research outweighs its drawbacks (Eckermann et al., 2010; Sterling et al., 1995).

To increase usability, we report all necessary measurements of uncertainty and variability (Calin-Jageman & Cumming, 2019), permitting future hypothesis generation and integration into meta-analyses (Cumming, 2014; Fritz et al., 2012). For example, replications can consider the effect sizes reported and their confidence intervals to estimate future results (e.g., prediction intervals; Cumming, 2008), and calculate the statistical power needed to reproduce the effect (e.g., considering ξ33%; see, Simonsohn, 2015).