Of differing methods, disputed estimates and discordant interpretations: the meta-analytical multiverse of brain volume and IQ associations. Jakob Pietschnig, Daniel Gerdesmann, Michael Zeiler and Martin Voracek. Royal Society Open Science, May 11 2022. https://doi.org/10.1098/rsos.211621
Abstract: Brain size and IQ are positively correlated. However, multiple meta-analyses have led to considerable differences in summary effect estimations, thus failing to provide a plausible effect estimate. Here we aim at resolving this issue by providing the largest meta-analysis and systematic review so far of the brain volume and IQ association (86 studies; 454 effect sizes from k = 194 independent samples; N = 26 000+) in three cognitive ability domains (full-scale, verbal, performance IQ). By means of competing meta-analytical approaches as well as combinatorial and specification curve analyses, we show that most reasonable estimates for the brain size and IQ link yield r-values in the mid-0.20s, with the most extreme specifications yielding rs of 0.10 and 0.37. Summary effects appeared to be somewhat inflated due to selective reporting, and cross-temporally decreasing effect sizes indicated a confounding decline effect, with three quarters of the summary effect estimations according to any reasonable specification not exceeding r = 0.26, thus contrasting effect sizes were observed in some prior related, but individual, meta-analytical specifications. Brain size and IQ associations yielded r = 0.24, with the strongest effects observed for more g-loaded tests and in healthy samples that generalize across participant sex and age bands.
4. Discussion
In this quantitative research synthesis, we show that positive associations of in vivo brain volume with IQ are highly reproducible. This link is consistently observable regardless of which empirical studies are included in a formal meta-analysis and how they are analysed. Results of our analyses convergently indicate that the effect strength must be assumed to be small-to-moderate in size, with the best available estimates for healthy participants in full-scale IQ ranging from r = 0.24 (uncorrected; approximately 6% explained variance) to 0.29 (corrected approximately 8% explained variance). Effects for full-scale IQ appear to be stronger and more systematically related to moderators compared to verbal and performance IQ. However, these three intelligence domains are highly intercorrelated and their correlation with IQ test results are to be seen as manifestations of a largely similar true effect across domains. We, therefore, focus on full-scale IQ findings of healthy samples in our discussion, unless indicated otherwise.
4.1. Comparisons with previous meta-analyses
The strengths of the observed summary effects in the present meta-analysis correspond closely to those identified by Pietschnig et al. [24], although the number of participants in this updated analysis is more than three times larger. The observed association for full-scale IQ in healthy samples (i.e. corresponding to selection criteria of the meta-analyses from [25], and [23]) resulted in an estimate of r = 0.24 (95% CI [0.22; 0.27]), thus indicating considerably lower associations than those reported by Gignac & Bates [25]) and McDaniel [23]). Key characteristics of the available meta-analyses are summarized in table 5.
Table 5.
Characteristics of available meta-analyses on the in vivo brain volume and intelligence link. Note. k = number of independent samples in analysis; summary effect = best estimate according to authors of meta-analysis; when both Hedges & Olkin- as well as Hunter & Schmidt-typed analyses were performed, both estimates are provided, respectively.
It could be argued that these inconsistencies are to a certain extent due to the differing methodological focus of the used analyses because both meta-analyses of Gignac & Bates [25] and McDaniel [23] reported values that were corrected for direct range restriction. However, when we respecified our analyses to apply identical methods, full-scale IQ associations for healthy samples once more led to a lower estimate, yielding r = 0.29. This indicates that the reported estimates of prior Hunter & Schmidt-based syntheses were inflated (i.e. even before accounting for dissemination bias).
This idea is supported by our analyses of individual data subsets that used the very same specifications as these prior studies. For instance, Gignac & Bates [25] showed that IQ assessments with higher g-ness (i.e. reflecting abilities that are more closely related to psychometric g, thus providing a better representation of cognitive abilities) yielded larger associations than less g-loaded assessments. They concluded that the most salient estimate of the brain volume and IQ association averages r = 0.40 (i.e. corresponding to about 16% of explained variance), based on a specific subset of effect sizes that should provide the most credible results (i.e. using healthy samples, tests with excellent g-ness and attenuation-corrected effect sizes only).
None of the reasonable specifications that were included in our specification curve analysis yielded a summary effect that was larger than r = 0.37. Importantly, this most extreme upper value of all possible specifications was based on the very same inclusion criteria as the specification that is supposed to represent the best operationalization of this association according to Gignac & Bates [25], healthy samples, excellent g-ness, range departure corrected, Hunter & Schmidt estimator), excepting sample age (this uppermost value was based on children/adolescents only; the same specification with all ages yielded r = 0.34, corresponding to 11% of explained variance). This is important for a number of reasons.
First, it shows that the specification that was chosen by Gignac & Bates [25] leads to estimates in the extreme upper tail of the distribution of reasonable summary effects. Besides yielding uncharacteristically large values, these estimates have large confidence intervals (i.e. representing higher effect volatility), because they are based on comparatively small sample numbers. Results from our combinatorial meta-analyses showed that at least 75% (i.e. the bottom three quartiles) of results yielded values below r = 0.26.
Second, these findings suggest that the estimate reported in Gignac & Bates [25] must be considered to have been inflated, even when one was to assume that this extreme specification yields the most salient estimate for the brain volume and IQ association (i.e. the summary effect in [25], exceeds the upper threshold of any estimate of the present summary effect distribution). Third, the lower summary effects in the present analyses compared to the earlier estimate of Gignac & Bates [25], when identical specifications were used, indicate that the studies that were added in the present update of the literature reported lower correlations, thus conforming to a decline effect [21,22].
Consistent with this interpretation, publication years of primary studies predicted brain volume and IQ associations negatively, indicating decreasing effect sizes over time. Cross-temporally declining effect sizes have been demonstrated to be prevalent in psychological science in general and intelligence research in particular, especially when initial study sample sizes are small [22]. This means that early and small n (=imprecise) primary study reports represent more often than not overestimates of the brain size and IQ association, thus having led to inflated meta-analytic summary effects. The presently observed effect declines and comparatively large effect estimates of early small-n studies (e.g. [5]) are consistent with the decline effect and its assumed drivers.
4.2. Moderators
It is unsurprising that effects were typically stronger in healthy than in patient samples because the included patients suffered from different conditions that are likely to impair cognitive functioning (e.g. autism, brain traumas, schizophrenia) which is bound to introduce statistical noise into the data. Therefore, effects of moderators were substantially weaker and less unequivocal for patients than for healthy samples.
Consistent with Gignac & Bates [25], there were stronger associations with highly g-loaded tests compared to fairly g-loaded ones in healthy participants (uncorrected rs = 0.31 versus 0.19; Q(2) = 23.69; p < 0.001), but not in patient samples. These results were supported by the findings from our regression analyses where larger g-ness positively predicted effect sizes of healthy participants.
Within any examined subgroup, correlations that had been reported within publications were numerically larger than those that had been obtained through personal communications or from the grey literature. This suggests that correlations were selectively reported in the published literature although only differences in full-scale IQ associations of healthy samples reached nominal significance. This observation is consistent with effect inflation because larger associations are more likely than smaller ones to be numerically reported in the literature (numerically stronger effects are more likely to become significant—depending on sample sizes and accuracy—and therefore more likely to be published), thus potentially leading to inadequate assumptions of the readers about the effect strength. This finding is supported by results from our regression analyses that showed weaker effects of unpublished than published effect sizes. This suggests that the reported effects in the brain size and intelligence literature are more often inflated than not, thus conforming to results from Pietschnig et al. [24].
In a similar vein, publication years were negatively related to effect sizes, thus indicating a confounding decline effect [21] and conforming to cross-temporally decreasing effect sizes as reported in an earlier meta-analysis [24].
The only further moderator with consistent directions in terms of the observed association appeared to be measurement type which consistently yielded larger estimates for intracranial than for total brain volume, although these differences did not reach nominal significance (except for verbal IQ in patient samples). There were no consistent patterns in regard to age or sex in subgroup or regression analyses, thus conforming to a previous account that indicated that brain volume and IQ associations generalize over participant age bands and sex ([24]; but see [23], for conflicting findings).
4.3. Dissemination bias
Three of our formal methods for detecting dissemination yielded significant bias indications for both full-scale and performance IQ (Sterne & Egger's regression, Trim-and-Fill analysis, Copas & Shi's method), while only one method (Trim-and-Fill analysis) indicated bias in verbal IQ. The evidence for bias was stronger for full-scale than performance IQ. It should be noted, that both Sterne and Egger's regression, as well as the Trim-and-Fill analysis, are funnel plot asymmetry-based methods and consequently particularly sensitive for the detection of small-sample effects. This means that the detected bias seems to be rooted in the correspondingly large error variance of underpowered (i.e. small sample size) studies and is consistent with previously raised concerns about suboptimal power in neuroscientific research [201]. Viewed from this perspective, declining effect sizes over time appear to be somewhat reconciliatory, because this may well mean that average study power has increased in this field (or at least in studies addressing this research question).
The low observed replicability indices for all three domains further corroborate the evidence for effect inflation. Similarly, results of our effect estimations by means of p-value-based methods support the evidence for confounding dissemination bias, as previously observed in regard to this research question [24]. This interpretation is consistent with larger effects from published sources than from those that were obtained from the grey literature or personal communications, although these differences only reached nominal significance in meta-regressions, but not subgroup analyses.
The present findings contrast the conclusions of Gignac & Bates [25] who did not identify bias evidence in their analysis. This discrepancy may be due to two different causes.
On the one hand, Gignac & Bates [25] included unpublished results in the publication bias detection analyses (i.e. results that [24], had obtained from the grey literature or through personal communications with authors), which (i) prevent potential bias from detection and (ii) are conceptually unsuitable to be used in p-curve and p-uniform analyses [50,51]. On the other hand, different methods of dissemination bias detection are not equally sensitive for different forms of bias, thus necessitating a triangulation of methods for bias estimation according to current recommendations [42]. Relying on comparatively few and conceptually similar detection methods (i.e. publication bias tests of two p-value-based methods; p-curve and p-uniform; Henmi-Copas approach) may have contributed to the non-detection of bias evidence in this past meta-analysis [25], particularly because these methods are not suitable to detect small-sample effects.
Although the present findings indicate a presence of confounding publication bias, this should not be interpreted as evidence against a brain volume and IQ link. As pointed out above, these associations appear to generalize across numerous potential moderators and replicate well in terms of the identified direction. However, confounding dissemination bias suggests that the obtained summary effects in many primary studies (and even some meta-analyses) represent inflated estimates of the true association. However, it needs to be acknowledged that the future development of more reliable methods for assessing IQ on the one or in vivo brain volume on the other hand may lead to larger correlation estimates in primary studies. Nonetheless, the strength of the brain volume and IQ association must be considered to be small-to-medium-sized at best.
4.4. Significance of the observed effect
On the one hand, the strength of the observed summary effect suggests that effects of mere neuron numbers, glial cells, or brain reserve are unlikely candidates for the explanation of between-individuals intelligence differences. On the other hand, the effect is clearly non-trivial and has turned out to be remarkably reproducible in terms of its positive direction across a large number of primary studies. Consequently, brain volume should not be seen as a supervenient (i.e. one-to-one) but rather an isomorphic (i.e. many-to-one) proxy of human intelligence. This may mean that brain volume in its own right is too coarse of a measure to reliably predict intelligence differences. It seems likely that examining the role of functional aspects (e.g. white matter integrity) and more fine-grained structural elements (e.g. cortical thickness; see [2]) may help in further clarifying the neurobiological bases of human intelligence.