Calibration of scientific reasoning ability. Caitlin Drummond Otten, Baruch Fischhoff. Journal of Behavioral Decision Making, November 4 2022. https://doi.org/10.1002/bdm.2306
Abstract: Scientific reasoning ability, the ability to reason critically about the quality of scientific evidence, can help laypeople use scientific evidence when making judgments and decisions. We ask whether individuals with greater scientific reasoning ability are also better calibrated with respect to their ability, comparing calibration for skill with the more widely studied calibration for knowledge. In three studies, participants (Study 1: N = 1022; Study 2: N = 101; and Study 3: N = 332) took the Scientific Reasoning Scale (SRS; Drummond & Fischhoff, 2017), comprised of 11 true–false problems, and provided confidence ratings for each problem. Overall, participants were overconfident, reporting mean confidence levels that were 22.4–25% higher than their percentages of correct answers; calibration improved with score. Study 2 found similar calibration patterns for the SRS and another skill, the Cognitive Reflection Test (CRT), measuring the ability to avoid intuitive but incorrect answers. SRS and CRT scores were both associated with success at avoiding negative decision outcomes, as measured by the Decision Outcomes Inventory; confidence on the SRS, above and beyond scores, predicted worse outcomes. Study 3 added an alternative measure of calibration, asking participants to estimate the number of items answered correctly. Participants were less overconfident by this measure. SRS scores predicted correct usage of scientific information in a drug facts box task and holding beliefs consistent with the scientific consensus on controversial issues; confidence, above and beyond SRS scores, predicted worse drug facts box performance but stronger science-consistent beliefs. We discuss the implications of our findings for improving science-relevant decision-making.
5 GENERAL DISCUSSION
Across three studies, we find that people with greater ability to evaluate scientific evidence, as measured by scores on the SRS, have greater metacognitive ability to assess that skill. Using a confidence elicitation paradigm common to studies of confidence in knowledge, we found that individuals with low SRS scores greatly overestimate their skill, while those with high skills slightly underestimate their skill, a pattern previously found with confidence in beliefs (e.g., Kruger & Dunning, 1999; Lichtenstein et al., 1982; Lichtenstein & Fischhoff, 1977; Moore & Healy, 2008).
Study 2 replicated these patterns with confidence in SRS skills and with calibration for another skill-based task, the CRT, which assesses the ability to avoid immediately appealing but incorrect “fast lure” answers and then find correct ones (Attali & Bar-Hillel, 2020; Frederick, 2005; Pennycook et al., 2017). As a test of external validity, Study 2 also found that people with better SRS scores had better scores on the DOI, a self-report measure of avoiding negative decision outcomes (Bruine de Bruin et al., 2007); confidence on the SRS, controlling for knowledge, was associated with worse DOI scores.
Study 3 replicated the results of Studies 1 and 2, asking participants how confident they are in each SRS answer, now called “local calibration.” It also assessed “global calibration,” derived from asking participants how many items they thought they answered correctly. Overconfidence was much smaller with the global measure, as found elsewhere (e.g., Ehrlinger et al., 2008; Griffin & Buehler, 1999; Stone et al., 2011). This finding suggests that global calibration may be more appropriate for situations where individuals reflect and act on a set of tasks, rather than act on tasks one by one. However, in an experimental setting, it may also convey a demand characteristic, with an implicit challenge to be less confident (if interpreted as, “How many do you think that you really got right?”), artifactually reducing performance estimates and overconfidence.
Study 3 also included additional measures of construct and external validity. As a test of construct validity, we found that global confidence, controlling for scores, was unrelated to a self-report measure of intellectual humility (Leary et al., 2017), and local confidence, controlling for scores, was unexpectedly positively related to self-reported intellectual humility. These findings may reflect the limitations of self-report measures, including a desirability bias in reporting.
Study 3 further found that SRS scores predicted performance on two science-related tasks: extracting information from a drug facts box (Woloshin & Schwartz, 2011) and holding beliefs consistent with the scientific consensus (as in previous work, Drummond & Fischhoff, 2017). However, confidence, controlling for knowledge, played different roles for these outcomes: It was negatively associated with scores on the drug facts box test, but positively associated with holding beliefs consistent with science on controversial issues. These findings suggest that while those with greater confidence in their scientific reasoning ability may also be more confident in their beliefs on scientific issues, confidence that is out of step with knowledge may hinder decision-making. Neither scores nor confidence was related to self-reported adoption of pandemic health behaviors, perhaps reflecting partisan divisions that reduce the role of individual cognition (e.g., Bruine de Bruin et al., 2020). Future work could examine the role of confidence, above and beyond knowledge, in other science-relevant judgments and decisions, including falling sway to pseudoscientific claims or products.
Individuals' metacognitive understanding of the extent of their knowledge has been related to life events in many domains (Bruine de Bruin et al., 2007; Parker et al., 2018; Peters et al., 2019; Tetlock & Gardner, 2015). Overall, we find that unjustified confidence (Parker & Stone, 2014) in scientific reasoning ability, as reflected in self-reported confidence in the correctness of one's answer adding predictive value to SRS scores (Drummond & Fischhoff, 2017), is associated with reduced avoidance of negative outcomes and worse performance on tasks that require using scientific information, but greater acceptance of the scientific consensus on controversial issues. Unlike Peters et al. (2019), who found that mismatches between skill and confidence were associated with worse outcomes, we found that unjustified confidence (measured both locally and globally) was associated similarly with outcomes at all levels of reasoning ability. These findings may reflect differences between numeracy and scientific reasoning, differences between the studies' measures of confidence and outcomes, or interactions too weak to be detected with the statistical power of the present research. Our findings may also reflect our measures' range restrictions: Here, confidence was elicited as expected performance, thus restricting the extent to which participants with very low or high performance could display underconfidence or overconfidence, respectively. Future work could seek other measures that could further separate the respective contributions of scientific reasoning ability and metacognition about it, such as a subjective scientific reasoning ability scale similar to the Subjective Numeracy Scale (Fagerlin et al., 2007).
Overall, we observed patterns of metacognition for cognitive skills similar to those observed for beliefs, using conventional confidence elicitation methods with known artifacts. Prior research has proposed a variety of methods for measuring overconfidence, with varying strengths and limitations. We discuss several key limitations below; for further discussion of these measurement issues, we refer readers to Lichtenstein and Fischhoff (1977), Erev et al. (1994), Moore and Healy (2008), Fiedler and Unkelbach (2014), Parker and Stone (2014), and Yates (1982).
The dramatically poor performance of the lowest quartile, for both SRS and CRT, is notable. As the groups were identified based on SRS and CRT scores, some of the spread is artifactual (as noted by Lichtenstein & Fischhoff, 1977, and others). One known artifact is the truncated 50–100% response mode, which precludes perfect calibration for participants who answer fewer than 50% of the SRS questions correctly (N = 444 [43% of respondents] in Study 1; N = 34 [34%] in Study 2; and N = 148 [45%] in Study 3). In a post hoc analysis, we treated these respondents as though they had answered 50% of questions correctly. Even with this change, they were still overconfident, by, on average, 29.8% in Study 1 (SD = 10), 31.5% in Study 2 (SD = 11), and 30.5% in Study 3 (SD = 10).
One limitation of these results is that the SRS or CRT tests might have demand effects atypical of real-life tests of cognitive skills, such that participants assume that an experimental task would not be as difficult as these proved to be or want to appear knowledgeable in this setting (Fischhoff & Slovic, 1980). A second possible limitation is that reliance on imperfectly reliable empirical measures somehow affects the patterns of correlations and not just the differences between the groups (Erev et al., 1994; Lichtenstein & Fischhoff, 1977). Attempts to correct for such unreliability have had mixed results (Ehrlinger et al., 2008; Krueger & Mueller, 2002; Kruger & Dunning, 2002). Third, task incentives were entirely intrinsic; conceivably, if clearly explained, calibration-based material rewards might have improved performance. Here, too, prior results have been mixed (Ehrlinger et al., 2008; Mellers et al., 2014). Fourth, our measure of science education, whether participants had a college course, may have been too poor to detect a latent relationship. Fifth, for Study 2, some participants may have seen the CRT items before (Haigh, 2016; Thomson & Oppenheimer, 2016), potentially increasing their scores (Bialek & Pennycook, 2018), with uncertain effects on confidence and calibration. Finally, our version of the CRT, which asked participants to choose between the fast lure and the correct answer, produced higher scores that the usual open-ended response mode and hence might not generalize to other CRT research.
If our results regarding the similarity between calibration for cognitive skill and knowledge prove robust, future work might seek to improve public understanding of science (e.g., Bauer et al., 2007; Miller, 1983, 1998, 2004) by addressing separately the ability to think critically and the need to stop and think critically. If people are as overconfident in their scientific reasoning ability as many participants were here, it may not be enough to correct erroneous beliefs through science communication and education (e.g., Bauer et al., 2007; Miller, 1983, 1998, 2004). People may also need help in reflecting on the limits to their ability to evaluate evidence and their potential vulnerability to manipulation by misleading arguments as well as by misleading evidence. Mental models approaches to science communication offer one potential strategy, by affording an intuitive feeling for how complex processes work (e.g., Bruine de Bruin & Bostrom, 2013; Downs, 2014). The inoculation approach to combating misinformation (Cook et al., 2017; van der Linden et al., 2017) offers another potential strategy, refuting misinformation in advance, so that people have a better feeling for when and how to think about the issues and when and how they can be deceived. Developing effective interventions requires research examining the separate contributions of scientific reasoning ability and metacognition to improving science-relevant judgments and decisions.
No comments:
Post a Comment