Wednesday, July 22, 2020

Longitudinal studies do not appear to support substantive long-term links between aggressive game content & youth aggression; links appear better explained by methodological weaknesses & researcher expectancy effects

Do longitudinal studies support long-term relationships between aggressive game play and youth aggressive behaviour? A meta-analytic examination. Aaron Drummond, James D. Sauer and Christopher J. Ferguson. Royal Society Open Science, July 22 2020. https://doi.org/10.1098/rsos.200373

Abstract: Whether video games with aggressive content contribute to aggressive behaviour in youth has been a matter of contention for decades. Recent re-evaluation of experimental evidence suggests that the literature suffers from publication bias, and that experimental studies are unable to demonstrate compelling short-term effects of aggressive game content on aggression. Long-term effects may still be plausible, if less-systematic short-term effects accumulate into systematic effects over time. However, longitudinal studies vary considerably in regard to whether they indicate long-term effects or not, and few analyses have considered what methodological factors may explain this heterogeneity in outcomes. The current meta-analysis included 28 independent samples including approximately 21 000 youth. Results revealed an overall effect size for this population of studies (r = 0.059) with no evidence of publication bias. Effect sizes were smaller for longer longitudinal periods, calling into question theories of accumulated effects, and effect sizes were lower for better-designed studies and those with less evidence for researcher expectancy effects. In exploratory analyses, studies with more best practices were statistically indistinguishable from zero (r = 0.012, 95% confidence interval: −0.010, 0.034). Overall, longitudinal studies do not appear to support substantive long-term links between aggressive game content and youth aggression. Correlations between aggressive game content and youth aggression appear better explained by methodological weaknesses and researcher expectancy effects than true effects in the real world.


4. Discussion

Experimental investigations of the short-term effects of aggressive game content on player aggression produce inconsistent results [2]. As can now be seen, both an initial meta-analysis without much consideration of methodological moderators [9] and the current, updated meta-analysis suggest that effects fall below the r = 0.10 benchmark for a small effect. Publication bias indicators yielded no evidence of publication bias. Thus, current research is unable to support the hypothesis that violent video games have a meaningful long-term predictive impact on youth aggression. However, a number of findings merit more explicit consideration.

4.1. How to interpret weak effects

First, as noted, the overall effect of aggressive game content on behavioural aggression was below our preregistered cut-off for a practically meaningful effect (and the traditional cut-off to be considered small). This brings us to acknowledge one weakness of meta-analysis in general, namely the focus on statistical significance. We observe that, for years, scholars have acknowledged that ‘statistical significance’ is a poor benchmark for theory support [37] yet psychologists often naively rely upon it when making decisions. We argue that, particularly in highly powerful analyses such as meta-analysis, the concept of statistical significance becomes irrelevant as almost everything is statistically significant. Small effects, even where statistically significant, are often explained through methodological noise such as common method variance, demand characteristics or single-responder bias. Indeed, in our study we find that effect sizes are largely inflated through issues such as poorly standardized and validated measures of both aggression and violent game content. As such, relying on ‘statistical significance’ can give scholars an inflated sense of confidence in their hypotheses and render the concept of ‘effect size’ little more than window dressing, where any effect size, no matter how small, can be interpreted as supporting the hypothesis.
We acknowledge that our adoption of the r = 0.10 standard is likely to stimulate debate, which we believe to be important and welcome. Although we adopted the 0.10 standard suggested by Przybylski & Weinstein [10], one of the authors has previously suggested that an even higher standard of 0.20 may be necessary for greater confidence in the validity of effects [38] though the origins of such concerns about over-reliance on statistical significance and over-interpretation of weak effects stretches back decades. As expressed by Lykken [39, p. 153] ‘the effects of common method are often as strong as or stronger than those produced by the actual variables of interest’. This raises the question of to what degree we can have confidence that observed effect sizes reflect the relationship of interest as opposed to research artefacts. To be fair, some scholars do argue for interpretation of much lower effect sizes, such as r = 0.05 [40], though it is important that a key phrase in this argumentation is noted: ‘Our analysis is based on a presumption that the effect size in question is, in fact, reliably estimated’ [40, p. 163]. Our observation is that this assumption appears to have been demonstrated to be false for this field of research and, with that in mind, a higher threshold of scrutiny is warranted. Funder and Ozer's argument also relies on effects accumulating over time, whereas our analysis found the opposite, that longer time-intervals were associated with smaller effect sizes. Our concerns are less about the issue that some effects may be of trivial importance (though there is that), but rather that some observed effect sizes do not index genuine effects of interest at all, being instead the product of systematic methodological limitations. Naturally, we do not suggest that our r = 0.10 threshold is the end of this debate. Further data may suggest that this number needs be revised either upwards or downwards (we suspect the former more likely than the latter). Standards may need to be flexible given the differences in rigor across different fields, or even across prior assumptions about the size of effect one expects to see. For example, there is a rough precedent for r = 0.10 from another meta-analysis of aggression and empathy wherein weak effect sizes results (r = 0.11) were interpreted as not hypothesis supportive [41]. The authors note this could be owing to either weaknesses in the theory or measurement problems (or both) and we agree those are both worthwhile issues to consider. Our observation that standardized and validated measures tend to produce weaker effects for this field, however, would appear to diminish the possibility for measurement attenuation as a driving factor of observed weak effect sizes. This cautious interpretation of weak effect sizes has precedent, as well, among meta-analyses of violent games. For instance, an earlier meta-analysis found larger effect sizes (r = 0.15), but based on methodological and theoretical issues identified in the field, interpreted this as non-convincing [42].
The adoption of the r = 0.10 standard also appears consistent with the ‘smallest effect sizes of interest’ (SESOI) approach. From this perspective, an SESOI can be developed based on multiple criteria including what is theoretically relevant, what prior literature has suggested is an important effect size, what effect sizes are considered important based on established (though ultimately arbitrary) criteria, and the degree to which resources may be burned on an effect without seeing tangible outcomes [43]. From this approach, we can see the r = 0.10 standard is defensible. Both Orben and Przybylski, as well as earlier standards set by Cohen [21], apply the 0.10 standard (though we acknowledge other scholars endorse either higher or lower standards). Further, previous meta-analyses have suggested effects should be in the range of 0.20–0.30 [44], so any observed effects under 0.10 would represent an appreciable decline in effect size. Lastly, as we observe significant methodological issues have the potential to inflate effect size estimates, as also noted by Przybylski and Weinstein, setting an interpretive threshold can help reduce misinterpretation of weak, possibly misleading results. We note our CI does not cross the 0.10 threshold and, as such, feel confident in interpreting that the threshold for interpreting the longitudinal data as meaningful has not been met.
These debates regarding the interpretation of small effect sizes exist in other realms as well. This is particularly true when large samples may result in many ‘statistically significant’ relationships between variables that bear little actual relationship to each other. For instance, one recent study linked emotional diversity to mental and physical health in two samples totalling 37 000 relying on effect sizes generally in the range of r = 0.10 and lower (some as low as r = 0.02) [45]. Reflecting our concerns here, this interpretation was criticized by other scholars who argued such weak findings were more like the product of statistical artefacts than genuine effects of interest [46]. Regarding the potential perils of misleading results in large samples, the first author of that critique states (N. Brown 2020, personal communication) ‘A large sample size is a good thing, but only if used to improve the precision of effect estimates, not to prove that cats equal dogs, p < 0.05’. We agree with this assessment. Naturally, our critique is not of the use of large samples, which we wholeheartedly endorse, but rather the lack of consideration for potential statistical ‘noise’ (demand characteristics, single-responder bias, common method variance, researcher expectancy effects, mischievous responding, etc.), and how these can cause misleading results (for a specific, discovered example, see [47]).
Alternatively, the issue could be considered from the ‘crud factor’ perspective of Meehl [48]. From this perspective, tiny effects are real though only in the sense that every low-level variable is correlated to every other low-level variable to some degree (i.e. the r = 0.00 is rarely strictly true). This alternative explanation returns the dialogue to that of triviality. If every variable is correlated to every other variable to a tiny degree and in a way that will become statistically significant in large samples, it is still valuable to understand which relationships rise above this ‘crud’ and are worthy of investigation or policy interventions. Otherwise, the argument that video games might be restricted to promote youth mental health may be no more critical than, quite literally, arguing for the restriction of potatoes or eyeglasses for the same reason [49].
We welcome debate on this issue and challenges to our own position. We believe that this is a discussion worth having and one which extends beyond video game research.

4.2. Other issues

Second, we find no evidence for the assertion that these small effects might accumulate over time. The negative association between the length of the longitudinal period and the size of the effect speaks directly against theories of accumulated effects. Indeed, we found that longer longitudinal periods were associated with smaller effect sizes, not larger, directly contradicting the accumulation narrative. This is consistent with older meta-analyses of experimental studies which, likewise, found that longer exposure times were associated with weaker effects [42]. However, this differed from a previous meta-analysis that found some evidence for a positive association between longitudinal time and effect size [9]. However, the Prescott et al. meta-analysis found this effect only for fixed effects analyses, not for random effects, and random effects would probably have been the more appropriate model given heterogeneity in the data. Further, longitudinal time was treated as a 3-part categorical variable rather than a more appropriate continuous variable, and this may have caused statistical artefacts. Our analysis also includes several newer longitudinal studies not included with Prescott et al. As to why longitudinal length is associated with reduced effect size, we can think of two categories of explanation. First, there is a genuine, small effect of interest, but this is relatively short-lived and does not accumulate. Second, there is not a genuine effect of interest and methodological issues such as demand characteristics or mischievous responding tend to have greater impact on short-term outcomes than long. Given our observations about widespread methodological limitations in this field and their impact on effect size, we suspect the latter option is more likely. As such, for this area at least, we recommend against using this narrative, and suspect it should be used more cautiously in other areas as well unless directly demonstrated through empirical studies.
Third, we demonstrate that study quality issues do matter. In particular, the use of standardized and well-validated measures matters. Specifically, the use of high-quality measures was associated with reduced effect sizes. This observation also undercuts claims in a previous meta-analysis for ethnic differences in video game effects [9]. In particular, studies with Latino participants were mostly done with a population from Laredo, Texas, and used highly validated measures such as the Child Behaviour Checklist. Thus, Latino ethnicity was conflated with highly validated, standardized aggression measures. Only one other study involved Hispanics, Caucasian Americans and African Americans, but this study both used a non-standard video game assessment, and an unstandardized aggression measure, and switched the psychometrics of the aggression measure in the final time point, making longitudinal control difficult [50]. As such, this study is not of a quality sufficient to examine for ethnic differences. Given there are no obvious reasons to think Latinos are immune to game effects, whereas non-Latinos are vulnerable, it is more parsimonious to conclude that differences in the quality of measures used were responsible for the observed ethnic differences.
Studies which were of higher quality (scoring above the median in best practices) returned an effect size statistically indistinguishable from zero. This suggests that some effects may be driven by lower quality practices that may inflate some effect sizes. It is worth noting too that issues such as the use of standardized and validated outcomes and other best practices tend to correspond with less citation bias. As such, concerns about best practices and researcher allegiances tend to overlap.
Lastly, studies evidencing citation bias had higher effect sizes than those that did not demonstrate citation bias. This may be an indication of researcher expectancy effects. As such, we recommend increased use of preregistration in empirical studies.
It is worth noting that our effect size of r = 0.059 may not appear very different from the controlled effect size of r = 0.078 obtained by Prescott and colleagues. However, this represents a reduction in explained variance from 0.608% to 0.348%. This is a reduction of approximately 43% of explained variance (0.348/0.608). Granted this reduction would seem more dramatic had the original figure of r = 0.078 been larger. As meta-analyses on violent video games have been repeated over time, there appears to be a consistent downwards tendency in their point estimates, declining from 0.15 in early estimates to 0.10 in uncontrolled and 0.078 in controlled longitudinal estimates. Our results show further reduction towards an effect size of 0 in a preregistered longitudinal meta-analysis employing theoretically relevant controls. Despite disagreement about where the precise line of the smallest effect of interest may be, downward trends in the meta-analytic point estimates over time suggest we need to, as a field, grapple with precisely where an effect becomes too small to be considered practically meaningful or risk overstating the importance of our findings.
Further, we observe that few studies were preregistered. Preregistration can be one means by which research expectancy effects can be reduced. Consistent with observations about upward bias in meta-analyses in other realms, it appears that, across study types, preregistered analyses have been much less likely to find results in support of the video game violence hypotheses than non-preregistered studies. Our meta-analysis is, to our knowledge, the first preregistered meta-analysis in this realm. Preregistration is seldom perfect, of course, and we recognize there will always be some debate and subjectivity in terms of extracting the ideal effect size that best represents a hypothesis; but preregistration of both individual studies and meta-analyses can help make decisions clearer and reduce researcher subjectivity at least partially.
At this juncture, we observe that meta-analytic studies now routinely find that the long-term impacts of violent games on youth aggression are near zero, with larger effects sizes typically associated with methodological quality issues. In some cases, overreliance on statistical significance in meta-analysis may have masked this poor showing for longitudinal studies. We call on both individual scholars as well as professional guilds such as the American Psychological Association to be more forthcoming about the extremely small observed relationship in longitudinal studies between violent games and youth aggression.

No comments:

Post a Comment