A
discipline-wide investigation of the replicability of Psychology
papers over the past two decades. Wu Youyou, Yang Yang, and Brian
Uzzi. Proceedings of the National Academy of Sciences, January 30,
2023, 120 (6) e2208863120. https://doi.org/10.1073/pnas.2208863120
Significance:
The number of manually replicated studies falls well below the
abundance of important studies that the scientific community would
like to see replicated. We created a text-based machine learning
model to estimate the replication likelihood for more than 14,000
published articles in six subfields of Psychology since 2000.
Additionally, we investigated how replicability varies with respect
to different research methods, authors 'productivity, citation
impact, and institutional prestige, and a paper’s citation growth
and social media coverage. Our findings help establish large-scale
empirical patterns on which to prioritize manual replications and
advance replication research.
Abstract:
Conjecture about the weak replicability in social sciences has made
scholars eager to quantify the scale and scope of replication failure
for a discipline. Yet small-scale manual replication methods alone
are ill-suited to deal with this big data problem. Here, we conduct a
discipline-wide replication census in science. Our sample (N = 14,126
papers) covers nearly all papers published in the six top-tier
Psychology journals over the past 20 y. Using a validated machine
learning model that estimates a paper’s likelihood of replication,
we found evidence that both supports and refutes speculations drawn
from a relatively small sample of manual replications. First, we find
that a single overall replication rate of Psychology poorly captures
the varying degree of replicability among subfields. Second, we find
that replication rates are strongly correlated with research methods
in all subfields. Experiments replicate at a significantly lower rate
than do non-experimental studies. Third, we find that authors’
cumulative publication number and citation impact are positively
related to the likelihood of replication, while other proxies of
research quality and rigor, such as an author’s university prestige
and a paper’s citations, are unrelated to replicability. Finally,
contrary to the ideal that media attention should cover replicable
research, we find that media attention is positively related to the
likelihood of replication failure. Our assessments of the scale and
scope of replicability are important next steps toward broadly
resolving issues of replicability.
Discussion
This research uses a machine learning model that quantifies the text in a scientific manuscript to predict its replication likelihood. The model enables us to conduct the first replication census of nearly all of the papers published in Psychology’s top six subfield journals over a 20-y period. The analysis focused on estimating replicability for an entire discipline with an interest in how replication rates vary by subfield, experimental and non-experimental methods, the other characteristics of research papers. To remain grounded in the human expertise, we verified the results with available manual replication data whenever possible. Together, the results further provide insights that can advance replication theories and practices.
A central advantage of our approach is its scale and scope. Prior speculations about the extent of replication failure are based on relatively small, selective samples of manual replications (
21). Analyzing more than 14,000 papers in multiple subfields, we showed that replication success rates differ widely by subfields. Hence, not one replication failure rate estimated from a single replication project is likely to characterize all branches of a diverse discipline like Psychology. Furthermore, our results showed that subfield rates of replication success are associated with research methods. We found that experimental work replicates at significantly lower rates than non-experimental methods for all subfields, and subfields with less experimental work replicate relatively better. This finding is worrisome, given that Psychology’s strong scientific reputation is built, in part, on its proficiency with experiments.
Analyzing replicability alongside other metrics of a paper, we found that while replicability is positively correlated with researchers’ experience and competence, other proxies of research quality, such as an author’s university prestige and the paper’s citations, showed no association with replicability in Psychology. The findings highlight the need for both academics and the public to be cautious when evaluating research and scholars using pre- and post-publication metrics as proxies for research quality.
We also correlated media attention with a paper’s replicability. The media plays a significant role in creating the public’s image of science and democratizing knowledge, but it is often incentivized to report on counterintuitive and eye-catching results. Ideally, the media would have a positive relationship (or a null relationship) with replication success rates in Psychology. Contrary to this ideal, however, we found a negative association between media coverage of a paper and the paper’s likelihood of replication success. Therefore, deciding a paper’s merit based on its media coverage is unwise. It would be valuable for the media to remind the audience that new and novel scientific results are only food for thought before future replication confirms their robustness.
We envision two possible applications of our approach. First, the machine learning model could be used to estimate replicability for studies that are difficult or impossible to manually replicate, such as longitudinal investigations and special or difficult-to-access populations. Second, predicted replication scores could begin to help prioritize manual replications of certain studies over others in the face of limited resources. Every year, individual scholars and organizations like Psychological Science Accelerator (
67) and Collaborative Replication and Education Project (
68) encounter the problem of choosing from an abundance of Psychology studies which ones to replicate. Isager and colleagues (
69) proposed that to maximize gain in replication, the community should prioritize replicating studies that are valuable and uncertain in their outcomes. The value of studies could be readily approximated by citation impact or media attention, but the uncertainty part is yet to be adequately measured for a large literature base. We suggest that our machine learning model could provide a quantitative measure of replication uncertainty.
We note that our findings were limited in several ways. First, all papers we made predictions about came from top-tier journal publications. Future research could examine papers from lower-rank journals and how their replicability associate with pre- and post-publication metrics (
70). Second, the estimates of replicability are only approximate. At the subfield-level, five out of six subfields in our analysis were represented by only one top journal. A single journal does not capture the scope of the entire subfield. Future research could expand the coverage to multiple journals for one subfield or cross-check the subfield pattern derived using other methods (e.g., prediction markets). Third, the training sample used to develop the model used nearly all the manual replication data available, yet still lacked direct manual replication for certain psychology subfields. While we conducted a series of transfer learning analyses to ensure the model’s applicability beyond the scope of the training sample, implementation of the model in the subfields of Clinical Psychology and Developmental Psychology, where actual manual replication studies are scarce should be done judiciously. For example, when estimating a paper’s replicability, we advise users to review a paper’s other indicators of replicability, like original study statistics, aggregated expert forecast, or prediction market. Nevertheless, our model can continue to be improved as more manual replication results become available.
Future research could go in several directions: 1) our replication scores could be combined with other methods like prediction markets (
16) or non-text-based machine learning models (
27,
28) to further refine estimates for Psychology studies; 2) the design of the study could be repeated to conduct replication censuses in other disciplines; and 3) the replication scores could be further correlated with other metrics of interest.
The replicability of science, which is particularly constrained in social science by variability, is ultimately a collective enterprise improved by an ensemble of methods. In his book
The Logic of Scientific Discovery, Popper argued that “we do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them” (
1). However, as true as Popper’s insight about repetition and repeatability is, it must be recognized that tests come with a cost of exploration. Machine learning methods paired with human acumen present an effective approach for developing a better understanding of replicability. The combination balances the costs of testing with the rewards of exploration in scientific discovery.