Comparing meta-analyses and preregistered multiple-laboratory replication projects. Amanda Kvarven, Eirik Strømland & Magnus Johannesson. Nature Human Behaviour, December 23 2019. https://www.nature.com/articles/s41562-019-0787-z
Abstract: Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost three times as large as replication effect sizes. We also implement three methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.
From the open version, OSF.io (17 studies then, 15 studies in the final version):
Discussion
To summarize our findings, we find that there is a significant difference between the metaanalytic effect size and the replication effect size for 12 of the 17 studies (70.6%), and
suggestive evidence for a difference in two additional studies. These differences are systematic
– the meta-analytic effect size is larger than the replication effect for all these studies- and on
average for all the 17 studies the estimated effect sizes are about 3 times as large in the metaanalyses. Interestingly, the relative difference in estimated effect sizes is of at least the same
magnitude as that observed between replications and original studies in the RP:P and other
similar systematic replication projects5,6,10. Publication bias and selective reporting in original
studies has been suggested as possible reasons for the low reproducibility in RP:P and other
replication projects, and our results suggest that these biases are not eliminated by the use of
meta-analysis.
To test further whether meta-analyses reduce the influence of publication bias or
selective reporting, we compare the average unweighted effect size of the original studies to the
meta-analyses. We were able to obtain effect sizes of the original studies converted to Cohen’s
D for all original studies except one where the standard deviation was unavailable.41 We were
additionally able to compute a valid standard error for 14 out of 17 original studies. The average
unweighted effect size of these 14 original studies is 0.561, which is about 42% higher than the
average unweighted effect size of 0.395 of the same 14 studies in the meta-analyses. These
point estimates are consistent with meta-analyses reducing the effect sizes estimated in original
studies somewhat, and in formal meta-analytic models the estimated difference between the
original effect and the summary effect in the meta-analysis varies between 0.089 and 0.166.
These estimated differences are not statistically significant but suggestive of a difference in all
three cases using our criterion for statistical significance. (see Supplementary Table 3 for
details). Further work on larger samples are needed to more conclusively test if meta-analytic
effect sizes differ from original effect sizes.
In a previous related study in medicine, 12 large randomized, controlled trials published
in four leading medical journals were compared to 19 meta-analyses published previously on
the same topics.24 They compared several clinical outcomes between the studies and found a
significant difference between the meta-analyses and the large clinical trials for 12% of the
comparisons. They did not provide any results for the pooled overall difference between metaanalyses and large clinical trials, but from graphically inspecting the results there does not
appear to be a sizeable systematic difference. Those previous results for medicine are thus
different from our findings. This could reflect a genuine difference between psychology and
medicine, but it could also reflect that even large clinical trials in medicine are subject to
selective reporting or publication bias or that large clinical trials with null results are published
in less prestigious journals.
Although we believe the most plausible interpretation of our results is that metaanalyses overestimate effect sizes on average in our sample of studies, there are other possible
explanations. In testing a specific scientific hypothesis in an experiment there can be
heterogeneity in the true effect size due to several sources. The true effect size can vary between
different populations (sample heterogeneity) and the true effect size can vary between different
experimental designs to test the hypothesis (design heterogeneity). If the exact statistical test
used or the inclusion/exclusion criteria of observations included in the analysis differ, this will
yield a third source of heterogeneity in estimated effect sizes (test heterogeneity). In the
multiple lab replications included in our study the design and statistical tests used is held
constant across the labs, whereas the samples vary across labs. The effect sizes across labs will
therefore vary due to sample heterogeneity, but not due to design or test heterogeneity. In the
meta-analyses the effect sizes can vary across the included studies due to sample, design- and
test heterogeneity. Sample, design or test heterogeneity could potentially explain our results.
For sample heterogeneity to explain our results, the replications need to have been
conducted in samples with on average lower true effect sizes than the samples included in the
studies in the meta-analyses. We find this explanation for our results implausible. The Many
Labs studies estimate the sample heterogeneity and only find small or moderate heterogeneity
in effect sizes7-9
. In the recent Many Labs 2 study the average heterogeneity measured as the
standard deviation in the true effect size across labs (Tau) was 0.048
. This can be compared to
the measured difference in meta-analytic and replication effect sizes in our study of 0.232-0.28
for the three methods.
For design or test heterogeneity to explain our results it must be the case that replication
studies select experimental designs or tests producing lower true effect sizes than the average
design and test included to test the same hypotheses in meta-analyses. For this to explain our
results the design and test heterogeneity in meta-analyses would have to be substantial and the
“replicator selection” of weak designs needs to be strong. This potential explanation of our
results would imply a high correlation between design and test heterogeneity in the metaanalysis and the observed difference in the meta-analytic and replication effect sizes; as a larger
design and test heterogeneity increases the scope for “replicator selection”. To further shed
some light on this possibility we were able to obtain information about the standard deviation
in true effect sizes across studies (Tau) for ten of the meta-analyses in our sample; Tau was
reported directly for two of these meta-analyses and sufficient information was provided in the
other eight meta-analyses so that we could estimate Tau. The mean Tau was 0.30 in these ten
meta-analyses with a range from 0.00 to 0.735. This is likely to be an upper bound on the design
and test heterogeneity as the estimated Tau also includes sample heterogeneity. While this is
consistent with a sizeable average design and test heterogeneity in the meta-analyses, it also
needs to be coupled with strong “replicator selection” to explain our results. To test for this, we
estimated the correlation between the Tau of these ten meta-analyses and the difference in the
meta-analytic and replication effect sizes. The Spearman correlation was -0.1879 (p=0.6032)
and the Pearson correlation was -0.3920 (p=0.2626), showing no sign of the observed
differences in effect sizes to be related to the scope for “replicator selection”. In fact, the
estimated correlation is in the opposite direction than the direction predicted by the “replicator
selection” mechanism. This tentative finding departs from a recent meta-research paper that
attributes reproducibility failures in Psychology to heterogeneity in the underlying effect
sizes.25Further work with larger samples is needed on this to more rigorously test for “replicator
selection”. It should also be noted that the pooled replication rate across Many Labs 1-3 is
53%, which is in line with the replication rate observed in three large scale systematic
replication project that should not be prone to “replicator selection” (the Reproducibility
Project: Psychology10, the Experimental Economics Replication Project5 and the Social
Sciences Replication project6
). This suggests no substantial “replicator selection” in the Many
Labs studies that form the majority of our sample.
Another caveat about our results concerns the representativity of our sample. The
inclusion of studies was limited by the number of pre-registered multiple labs replications
carried out so far, and for which of these studies we could find a matching meta-analysis. Our
sample of 17 studies should thus not be viewed as being representative of meta-analysis in
psychology or in other fields. In particular, the relative effect between the original studies and
replication studies for the sample of studies included in our analysis is somewhat larger than
the one observed in previous replication projects5,6,10 – indicating that our sample could be a
select sample of psychological studies where selective reporting is particularly prominent. In
the future the number of studies using our methodology can be extended as more pre-registered
multiple labs replications become available and as the number of meta-analyses continue to
increase. We also encourage others to test out our methodology for evaluating meta-analyses
on an independent sample of studies.
We conclude that meta-analyses produce substantially larger effect sizes than
replication studies in our sample. This difference is largest for replication studies that fail to
reject the null hypothesis, which is in line with recent arguments about a high false positive rate
of meta-analyses in the behavioral sciences20. Our findings suggest that meta-analyses is
ineffective in fully adjusting inflated effect sizes for publication bias and selective reporting. A
potentially effective policy for reducing publication bias and selective reporting is preregistering analysis plans prior to data collection. There is currently a strong trend towards
increased pre-registration in psychology22. This has the potential to increase both the credibility
of original studies, but also of meta-analyses, making meta-analysis a more valuable tool for
aggregating research results. Future meta-analyses may thus produce effect sizes that are closer
to the effect sizes in replication studies.
No comments:
Post a Comment