Thursday, November 26, 2020

Questions are raised regarding the plausibility of certain reports with effect sizes comparable to, or in excess of, the effect sizes found in maximal positive controls

Maximal positive controls: A method for estimating the largest plausible effect size. Joseph Hilgard. Journal of Experimental Social Psychology, Volume 93, March 2021, 104082. https://doi.org/10.1016/j.jesp.2020.104082

Highlights

• Some reported effect sizes are too big for the hypothesized process.

• Simple, obvious manipulations can reveal which effects are too big.

• A demonstration is provided examining an implausibly large effect.

Abstract: Effect sizes in social psychology are generally not large and are limited by error variance in manipulation and measurement. Effect sizes exceeding these limits are implausible and should be viewed with skepticism. Maximal positive controls, experimental conditions that should show an obvious and predictable effect, can provide estimates of the upper limits of plausible effect sizes on a measure. In this work, maximal positive controls are conducted for three measures of aggressive cognition, and the effect sizes obtained are compared to studies found through systematic review. Questions are raised regarding the plausibility of certain reports with effect sizes comparable to, or in excess of, the effect sizes found in maximal positive controls. Maximal positive controls may provide a means to identify implausible study results at lower cost than direct replication.

Keywords: Violent video gamesAggressionAggressive thoughtPositive controlsScientific self-correction


5. General discussion

Maximal positive controls can provide a cost-effective way to establish the upper bound of plausible effect sizes in a measure. These upper bounds can be useful in detecting errors in previously published literature. Although implausibly large effect sizes may indicate errors in data collection, errors in analysis, or even possible misconduct, it has been my experience that journals are reluctant to issue expressions of concern for implausibly large results. This reluctance may be caused by the difficulty in determining which results are “too big”—a subjective decision that depends on the judgments and expectations of individual researchers and editors. These individual judgments may be better aligned through the empirical support provided by the collection of maximal positive controls. In this way, maximal positive controls might help identify erroneous reports by providing an empirical estimate of how big is too big.

The three examples provided here revealed some possibly erroneous reports. Study 1 suggests that even the largest effect sizes observed on the story completion task should nevertheless be smaller than those repeatedly reported by Hasan et al., 2012Hasan et al., 2013Hasan et al., 2015. This indicates some manner of confound or error in the study. Because of this likely error, it is not clear that the inferences from Hasan et al. (2013) are correct: Violent video games might not increase hostile-world beliefs and aggressive behavior, hostile-world beliefs might not mediate effects of violent games on aggressive behavior, and effects of violent video games (if any) might not accumulate from day to day. To my knowledge, the only other such long-term experiment was that of Kühn et al. (2019), who observed that two months of Grand Theft Auto V caused an increase in word completion task scores, but no significant increase on a measure of aggressive world view, an aggressive-cognition lexical decision task, or the Buss-Perry aggression questionnaire. New research will be necessary to test these claims.

Study 3 similarly suggests that even the strongest aggressive-emotion Stroop effect should not exceed about 400 ms. A review of the literature finds a few aggression-emotion Stroop differences of comparable or greater magnitude (Smeijers et al., 2018Sun et al., 2019). There may be value in double-checking the accuracy of these reports.

In Study 2, by contrast, no studies using the word completion task approached the large effects found using maximal positive controls. Although individual differences in verbal skill may still represent a source of nuisance variance in this task, such differences do not seem to substantially limit the effect sizes one could obtain on this measure.

Researchers using these tasks may benefit from considering the effect size estimates in this study as benchmarks. For example, in the story completion task, if the difference between a peaceful architect and a mass murderer is d = 2.5, and the difference between that architect and an extreme sports enthusiast is d = 1.3, researchers should expect to find smaller effect sizes when using subtler manipulations and asking about the task's usual generic characters like “Todd” and “Jane.” Similarly, in the aggressive emotion Stroop, researchers should expect to find emotion Stroop effects of no more than 400 ms. When researchers estimate how many trials per participant or participants per study they should collect, reference to these estimates may help to inform power analyses by suggesting firm upper limits on even the most optimistic of effect size estimates. In the future, researchers may be able to develop heuristics about the typical ratio between an effect size observed in maximal positive control and in primary research.

One last practical suggestion can be made regarding the administration of the story completion task. Researchers can benefit from considering the influence of the different story stems, which elicited different mean scores. Although it is desirable to use multiple task stimuli to improve the task's generalizability, failing to model the effects of stimulus will leave those effects as error variance, reducing the effect size and degrading study power. The Condition × Scenario interaction suggests that the car accident scenario may be more sensitive than the other scenarios, perhaps by avoiding a floor effect.

Researchers are encouraged to use maximal positive controls to inspect the plausibility of effect sizes reported in their literatures. Maximal positive controls may be collected at lower cost than direct replications. Because maximal positive controls are deliberately dissimilar from original studies, they may also avoid some concerns common to direct replications such as omitted moderators (Stroebe & Strack, 2014), contextual sensitivity of effects (Van Bavel, Mende-Siedlecki, Brady, & Reinero, 2016), or the presence or absence of researcher “flair” (Baumeister, 2016). These concerns may avoided when there is a strong logical case that the maximal positive control should yield an effect strictly larger than the original work. Through the use of this method, researchers may learn more about the properties of their measurements, the range of plausible effect sizes, and the quality of research data, thereby facilitating faster scientific self-correction and improving the quality of data used in theory development.

No comments:

Post a Comment