Cognitive Training: A Field in Search of a Phenomenon. Fernand Gobet, Giovanni Sala. Perspectives on Psychological Science, August 8, 2022. https://doi.org/10.1177/17456916221091830
Abstract: Considerable research has been carried out in the last two decades on the putative benefits of cognitive training on cognitive function and academic achievement. Recent meta-analyses summarizing the extent empirical evidence have resolved the apparent lack of consensus in the field and led to a crystal-clear conclusion: The overall effect of far transfer is null, and there is little to no true variability between the types of cognitive training. Despite these conclusions, the field has maintained an unrealistic optimism about the cognitive and academic benefits of cognitive training, as exemplified by a recent article (Green et al., 2019). We demonstrate that this optimism is due to the field neglecting the results of meta-analyses and largely ignoring the statistical explanation that apparent effects are due to a combination of sampling errors and other artifacts. We discuss recommendations for improving cognitive-training research, focusing on making results publicly available, using computer modeling, and understanding participants’ knowledge and strategies. Given that the available empirical evidence on cognitive training and other fields of research suggests that the likelihood of finding reliable and robust far-transfer effects is low, research efforts should be redirected to near transfer or other methods for improving cognition.
Keywords: cognitive training, meta-analysis, methodology, working memory training
As is clear from the empirical evidence reviewed in the previous sections, the likelihood that cognitive training provides broad cognitive and academic benefits is very low indeed; therefore, resources should be devoted to other scientific questions—it is not rational to invest considerable sums of money on a scientific question that has been essentially answered by the negative. In a recent article, Green et al. (2019) took the exact opposite of this decision—they strongly recommended that funding agencies should increase funding for cognitive training. This obviously calls for comments.
The aim of Green et al.’s (2019) article was to provide methodological recommendations and a set of best practices for research on the effect of behavioral interventions aimed at cognitive improvement. Among others, the addressed issues include the importance of distinguishing between different types of studies (feasibility, mechanistic, efficacy, and effectiveness studies), the type of control groups used, and expectation effects. Many of the points addressed in detail by Green et al. reflected sound and well-known research practices (e.g., necessity of running studies with sufficient statistical power, need for defining the terminology used, and importance of replications; see also Simons et al., 2016).
However, the authors made disputable decisions concerning central questions. These include whether superordinate terms such as “cognitive training” and “brain training” should be defined, whether a discussion of methods is legitimate while ignoring the empirical evidence for or against the existence of a phenomenon, the extent to which meta-analyses can compare studies obtained with different methodologies and cognitive-enhancement methods, and whether multiple measures should be used for a latent construct such as intelligence.
Lack of definitions
Although Green et al. (2019) emphasized that “imprecise terminology can easily lead to imprecise understanding and open the possibility for criticism of the field,” they opted to not provide an explicit definition of “cognitive training” (p. 4). Nor did they define the phrase “behavioral interventions for cognitive enhancement,” used throughout their article. Because they specifically excluded activities such as video-game playing and music (p. 3), we surmised that they used “cognitive training” to refer to computer tasks and games that aim to improve or maintain cognitive abilities such as WM. The term “brain training” is sometimes used to describe these activities, although it should be mentioned that Green et al. objected to the use of the term.
Note that researchers investigating the effects of activities implicitly or explicitly excluded by Green et al. (2019) have emphasized that the aim of those activities is to improve cognitive abilities and/or academic achievement, for example, chess (Jerrim et al., 2017; Sala et al., 2015), music (Gordon et al., 2015; Schellenberg, 2006), and video-game playing (Bediou et al., 2018; Feng et al., 2007). For example, Gordon et al.’s (2015) abstract concluded by stating that “results are discussed in the context of emerging findings that music training may enhance literacy development via changes in brain mechanisms that support both music and language cognition” (p. 1).
Green et al. (2019) provided a rationale for not providing a definition. Referring to “brain training,” they wrote:
We argue that such a superordinate category label is not a useful level of description or analysis. Each individual type of behavioral intervention for cognitive enhancement (by definition) differs from all others in some way, and thus will generate different patterns of effects on various cognitive outcome measures. (p. 4)
They also noted that even using subcategories such as “working-memory training” is questionable. They did note that “there is certainly room for debate” (p. 4) about whether to focus on each unique type of intervention or to group interventions into categories.
In line with common practice (e.g., De Groot, 1969; Elmes et al., 1992; Pedhazur & Schmelkin, 1991), we take the view that definitions are important in science. Therefore, in this article, we have proposed a definition of “cognitive training” (see “Defining Terms” section above), which we have used consistently in our research.
Current state of knowledge and meta-analyses
A sound discussion of methodology in a field depends on the current state of knowledge in this field. Whereas Green et al. (2019) used information gleaned from previous and current cognitive-training research to recommend best practices (e.g., use of previous studies to estimate the sample size needed for well-powered experiments), they also explicitly stated that they will not discuss previous controversies. We believe that this is a mistake because, as just noted, the choice of methods is conditional on the current state of knowledge. In our case, a crucial ingredient of this state is whether cognitive-training interventions are successful—specifically, whether they lead to far transfer. One of the main “controversies” precisely concerns this question, and thus it is unwise to ignore it.
Green et al. (2019) were critical of meta-analyses and argued that studies cannot be compared:
For example, on the basic research side, the absence of clear methodological standards has made it difficult-to-impossible to easily and directly compare results across studies (either via side-by-side contrasts or in broader meta-analyses). This limits the field’s ability to determine what techniques or approaches have shown positive outcomes, as well as to delineate the exact nature of any positive effects – e.g., training effects, transfer effects, retention of learning, etc. (p. 3)
These comments wholly underestimate what can be concluded from meta-analyses. Like many other researchers in the field, Green et al. (2019) assumed that (a) the literature is mixed and, consequently, (b) the inconsistent results depend on differences in methodologies between researchers. However, assuming that there is some between-studies inconsistency and speculating on where this inconsistency stems from is not scientifically apposite (see “The Importance of Sampling Error and Other Artifacts” section above). Rather, quantifying the between-studies true variance (τ2) should be the first step to take.
Using latent factors
In the section “Future Issues to Consider With Regard to Assessments,” Green et al. (2019, pp. 16–17) raised several issues with using multiple measures for a given construct such as WM. This practice has been recommended by authors such as Engle et al. (1999) to reduce measurement error. Several of Green et al.’s arguments merit discussion.
A first argument is that using latent factors—as in confirmatory factor analysis—might hinder the analysis of more specific effects. This argument is incorrect because the relevant information is still available to researchers (see Kline, 2016; Loehlin, 2004; Tabachnik & Fidell, 1996). By inspecting factor loadings, one can examine whether the preassessment/postassessment changes (if any) affect the latent factor or only specific tests (this is a longitudinal-measurement-invariance problem). Green et al. (2019) seemed to equate multi-indicator composites (e.g., summing z scores) with latent factors. Composite measures are the result of averaging or summing across a number of observed variables and cannot tell much about any task-specific effect. A latent factor is a mathematical construct derived from a covariance matrix within a structural model that includes a set of parameters that links the latent factor to the observed variables. That being said, using multi-indicator composites would be an improvement compared with the current standards in the field.
A second argument is that large batteries of tests induce motivational and/or cognitive fatigue in participants, especially with particular populations. Although this may be true, for example with older participants, large batteries have been used in several cognitive-training studies, and participants were able to undergo a large variety of testing (e.g., Guye & von Bastian, 2017). Nevertheless, instead of assessing many different constructs, it may be preferable to focus on one or two constructs at a time (e.g., fluid intelligence and WM). Such a practice would help reduce the number of tasks and the amount of fatigue.
Another argument concerns carryover and learning effects. The standard solution is to randomize the presentation order of the tasks. This procedure, which ensures that bias gets close to zero as the number of participants increases, is generally efficient if there is no reason to expect an interaction between treatment and order (Elmes et al., 1992). If this is the case, another approach can be used: counterbalancing the order of the tasks. However, complete counterbalancing is difficult with large numbers of tasks, and in this case, one often has to be content with incomplete counterbalancing using a Latin square (for a detailed discussion, see Winer, 1962).
A final point made by Green et al. (2019) is that using large batteries of tasks increases the rate of Type I errors. Although this point is correct, it is not an argument against multi-indicator latent factors. Rather, it is an argument in favor because those do not suffer from this bias. In addition, latent factors aside, there are many methods designed for correcting α (i.e., the significance threshold) for multiple comparisons (e.g., Bonferroni, Holm, false-discovery rate). Increased Type I error rates are a concern with researchers who ignore the problem and do not apply any correction.
One reasonable argument is that latent factor analysis requires large numbers of participants. The solution is offered by multilab trials. The ACTIVE trial—the largest experiment carried out in the field of cognitive training—was, indeed, a multisite study (Rebok et al., 2014). Another multisite cognitive-training experiment is currently ongoing (Mathan, 2018).
To conclude this section, we emphasize two points. First, it is well known that in general, single tests possess low reliability. Second, multiple measures are needed to understand whether improvements occur at the level of the test (e.g., n-back) or at the level of the construct (e.g., WM).
Some methodological recommendations
We are not as naive as to believe that our analysis will deter researchers in the field to carry out much more research on the putative far-transfer benefits of cognitive training despite the lack of any empirical evidence. We thus provide some advice about the directions that should be taken so that not all resources are spent in search of a chimera.
Making methods and results accessible, piecemeal publication, and objective report of results
We broadly agree with the methodological recommendations made by Green et al. (2019), such as reporting not only p values but also effect sizes and confidence intervals, and the need for well-powered studies. We add a few important recommendations (for a summary of the recommendations throughout this article, see Table 3). To begin with, it is imperative to put the data, analysis code, and other relevant information online. In addition to providing supplementary backup, this allows other researchers to closely replicate the studies and to carry out additional analyses (including meta-analyses)—important requirements in scientific research. By the same token and in the spirit of Open Science, researchers should reply to requests from meta-analysts asking for summary data and/or the original data. In our experience, response rate is currently 20% to 30% at best (e.g., Sala et al., 2018). Although we understand that it may be difficult to answer such replies positively when data were collected 20 years or more ago, there is no excuse for data collected more recently.
|
Just like other questionable research practices, piecemeal publication should be avoided (Hilgard et al., 2019). If dividing the results of a study into several articles cannot be avoided, the articles should clearly and unambiguously indicate the fact that this has been done and should reference the articles sharing the results.
There is one point made by Green et al. (2019) with which we wholeheartedly agree: the necessity of reporting results correctly and objectively without hyperbole and incorrect generalization. The field of cognitive training is littered with exaggerations and overinterpretations of results (see Simons et al., 2016). A fairly common practice is to focus on the odd statistically significant result even though most of the tests turn out nonsignificant. This is obviously capitalizing on chance and should be avoided at all costs.
In a similar vein, there is a tendency to overinterpret results of studies using neuroscience methods. A striking example was recently offered by Schellenberg (2019), who showed that in a sample of 114 journal articles published in the last 20 years on the effects of music training, causal inferences were often made although the data were only correlational; neuroscientists committed this logical fallacy more often than psychologists. There was also a rigid focus on learning and the environment and a concurrent neglect of alternative explanations, such as innate differences. Another example consists in inferring far transfer when neuroimaging effects are found but not behavioral effects. However, such an inference is illegitimate.
The need for detailed analyses and computational models
As a way forward, Green et al. (2019) recommended well-powered studies with large numbers of participants. In a similar vein, and focusing on the n-back-task training, Pergher et al. (2020) proposed large-scale studies isolating promising features. We believe that such an atheoretical approach is unlikely to succeed. There is an indefinite space of possible interventions (e.g., varying the type of training task, the cover story used in a game, the perceptual features of the material, the pace of presentation, ad infinitum), which means that searching this space blindly and nearly randomly would require a prohibitive amount of time. Strong theoretical constraints are needed to narrow down the search space.
There is thus an urgent need to understand which cognitive mechanisms might lead to cognitive transfer. As we showed above in the section on meta-analysis, the available evidence shows that the real effect size of cognitive training on far transfer is zero. Prima facie, this outcome indicates that theories based on general mechanisms, such as brain plasticity (Karbach & Schubert, 2013), primitive elements (Taatgen, 2013), and learning to learn (Bavelier et al., 2012), are incorrect when it comes to far transfer. We reach this conclusion by a simple application of modus tollens: (a) Theories based on general mechanisms such as brain plasticity, primitive elements, and learning to learn predict far transfer. (b) The empirical evidence shows that there is no far transfer. Therefore, (c) theories based on general mechanisms such as brain plasticity, primitive elements, and learning to learn are incorrect.
Thus, if one believes that cognitive training leads to cognitive enhancement—most likely limited to near transfer—one has to come up with other theoretical mechanisms than those currently available in the field. We recommend two approaches to identify such mechanisms, which we believe should be implemented before large-scale randomized controlled trials are carried out.
Fine analyses of the processes in play
The first approach is to use experimental methods enabling the identification of cognitive mechanisms. Cognitive psychology has a long history of refining such methods, and we limit ourselves to just a few pointers. A useful source of information consists in collecting fine-grained data, such as eye movements, responses times, and even mouse location and mouse clicks. Together with hypotheses about the processes carried out by participants, these data make it possible to rule out some mechanisms while making others more plausible. Another method is to design experiments that specifically test some theoretical mechanisms. Note that this goes beyond establishing that a cognitive intervention leads to some benefits compared with a control group. In addition, the aim is to understand the specific mechanisms that lead to this superiority.
It is highly likely that the strategies used by the participants play a role in the training, pretests, and posttests used in cognitive-training research (Sala & Gobet, 2019; Shipstead et al., 2012; von Bastian & Oberauer, 2014). It is essential to understand these strategies and the extent to which they differ between participants. Are they linked to a specific task or a family of tasks (near transfer), or are they general across many different tasks (far transfer)? If it turns out that such general strategies exist, can they be taught? What do they tell researchers about brain plasticity and changing basic cognitive abilities such as general intelligence?
Two studies that investigated the effects of strategies are mentioned here. Laine et al. (2018) found that instructing participants to employ a visualization strategy when performing n-back training improved performance. In a replication and extension of this study, Forsberg et al. (2020) found that the taught visualization strategy improved some of the performance measures in novel n-back tasks. However, older adults benefited less, and there was no improvement in WM tasks structurally different from n-back tasks. In the uninstructed participants, n-back performance correlated with the type of spontaneous strategies and their level of detail. The types of strategies also differed as a function of age.
A final useful approach is to carry out a detailed task analysis (e.g., Militello & Hutton, 1998) of the activities involved in a specific regimen of cognitive training and in the pretests and posttests used. What are the overlapping components? What are the critical components and those that are not likely to matter in understanding cognitive training? These components can be related to information about eye movements, response times, and strategies and can be used to inspire new experiments. The study carried out by Baniqued et al. (2013) provides a nice example of this approach. Using task analysis, they categorized 20 web-based casual video games into four groups (WM, reasoning, attention, and perceptual speed). They found that performance in the WM and reasoning games was strongly associated with memory and fluid-intelligence abilities, measured by a battery of cognitive tasks.
Cognitive modeling as a method
The second approach we propose consists of developing computational models of the postulated mechanisms, which of course should be consistent with what is known generally about human cognition (for a similar argument, see Smid et al., 2020). To enable an understanding of the underlying mechanisms and be useful in developing cognitive-training regimens, the models should be in a position to simulate not only the tasks used as pretests and posttests but also the training tasks. This is what Taatgen’s (2013) model is doing: It first simulates improvement in a complex verbal WM task over 20 training sessions and then simulates how WM training reduces interference in a Stroop task compared with a control group. (We would, of course, query whether this far-transfer effect is genuine.) By contrast, Green, Pouget, & Bavelier’s (2010) neural-network and diffusion-to-bound models simulate the transfer tasks (a visual-motion-direction discrimination task and an auditory-tone-location discrimination task) but do not simulate the training task with action video-game playing. Ideally, a model of the effect of an action video game should simulate actual training (e.g., by playing Call of Duty 2), processing the actual stimuli involved in the game. To our knowledge, no such model exists. Note that given the current developments in technology, modeling such a training task is not unrealistic.
The models should also be able to explain data at a micro level, including eye movements and verbal protocols (to capture strategies). There is also a need for the models to use exactly the same stimuli as those used in the human experiments. For example, the chunk hierarchy and retrieval structures model of chess expertise (De Groot et al., 1996; Gobet & Simon, 2000) receives as learning input the kind of board positions that players are likely to meet in their practice. When simulating experiments, the same stimuli are used as those employed with human players, and close comparison is made between predicted and actual behavior along a number of dimensions, including percentage of correct responses, number and type of errors, and eye movements. In the field of cognitive training, Taatgen’s (2013) model is a good example of the proper level of granularity for understanding far transfer. Note that, ideally, the models should be able to predict possible confounds and how modifications to the design of training would circumvent them. Indeed, we recommend that considerable resources be invested in this direction of research with the aim of testing interventions in silico before testing them in vivo (Gobet, 2005). Only those interventions that lead to benefits in simulations should be tested in trials with human participants. In addition to embodying sound principles of theory development and testing, such an approach would also lead to considerable savings of research money in the medium and long terms.
Searching for small effects
Green et al. (2019, p. 20) recognized the possibility that large effects are unlikely and that one should be content with small effects. They are also open to the possibility of using unspecific effects, such as expectation effects. It is known that many educational interventions bring a modest effect (Hattie, 2009), and thus, the question arises as to whether cognitive-training interventions are more beneficial than alternative ones. We argue that many other interventions are cheaper and/or have specific benefits when they directly match educational goals. For example, games related to mathematics are more likely to improve one’s mathematical knowledge and skills than n-back tasks and can be cheaper and more fun.
If cognitive training leads only to small and unspecific effects, one faces two implications, one practical and one theoretical. Practically, the search for effective training features has to operate blindly, which is very inefficient. This is because current leading theories in the field are incorrect, as noted above, and thus there is no theoretical guidance. Thus, effectiveness studies are unlikely to yield positive results. Theoretically, if the effectiveness of training depends on small details of training and pre/post measures, then the prospects of generalization beyond specific tasks are slim to null. This is unsatisfactory scientifically because science progresses by uncovering general laws and finding order in apparent chaos (e.g., the state of chemistry before and after Mendeleev’s discovery of the periodic table of elements).
A straightforward explanation can be proposed for the pattern of results found in our meta-analyses with respect to far transfer—small to zero effect sizes, low or null true between-studies variance. Positive effect sizes are just what can be expected by chance, features of design (i.e., active vs. passive control groups), regression to the mean, and sometimes publication bias. (If you believe that explanations based on chance are not plausible, consider Galton’s board: It perfectly illustrates how a large number of small effects can lead to a normal distribution. Likewise, in cognitive training, multiple variables and mechanisms lead to some experiments having a positive effect, others a negative effect, with most experiments centered around the mean of the distribution.) Thus, the search for robust and replicable effects is unlikely to be successful.
Note that the issue with cognitive training is not the lack of replications and the lack of reproducibility, which plague large swathes of psychology: The main results have been replicated often and form a highly coherent pattern when results are put together in (meta-)meta-analyses. Pace Pergher et al. (2020), we do not believe that variability of methods is an issue. On the contrary, the main outcomes are robust to experimental variations. Indeed, results obtained with many different training and evaluation methods converge (small-to-zero effect sizes and low true heterogeneity) and thus satisfy a fundamental principle in scientific research: the principle of triangulation (Mathison, 1988).
Funding agencies
Although Green et al.’s (2019) article is explicitly about methodology, it does make recommendations for funding agencies and lobbies for more funding: “We feel strongly that an increase in funding to accommodate best practice studies is of the utmost importance” (p. 17). On the one hand, this move is consistent with the aims of their article in that several of the suggested practices, such as using large samples and performing studies that would last for several years, would require substantial amounts of money to be carried out. On the other hand, lobbying for an increase in funding is made without any reference to results showing that cognitive training might not provide the hoped-for benefits. The authors only briefly discussed the inconsistent evidence for cognitive training, concluding that “our goal here is not to adjudicate between these various positions or to rehash prior debates” (p. 3). However, in general, rational decisions about funding require an objective evaluation of the state of the research. Obviously, if the research is about developing methods for cognitive enhancement, funders must take into consideration the extent to which the empirical evidence supports the hypothesis that the proposed methods provide domain-general cognitive benefits. As we showed in the “Meta-Analytical Evidence” section, there is little to null support for this hypothesis. Thus, our advice for funders is to base their decisions on the available empirical evidence and on the conclusions reached by meta-analyses.
As discussed earlier, our meta-analyses clearly show that cognitive training does not lead to any far transfer in any of the cognitive-training domains that have been studied. In addition, using second-order meta-analysis made it possible to show that the between-meta-analyses true variance is due to second-order sampling error and thus that the lack of far transfer generalizes to different populations and different tasks. Taking a broader view suggests that our conclusions are not surprising and are consistent with previous research. In fact, they were predictable. Over the years, it has been difficult to document far transfer in experiments (Singley & Anderson, 1989; Thorndike & Woodworth, 1901), industrial psychology (Baldwin & Ford, 1988), education (Gurtner et al., 1990), and research on analogy (Gick & Holyoak, 1983), intelligence (Detterman, 1993), and expertise (Bilalić et al., 2009). Indeed, theories of expertise emphasize that learning is domain-specific (Ericsson & Charness, 1994; Gobet & Simon, 1996; Simon & Chase, 1973). When putting this substantial set of empirical evidence together, we believe that it is possible to conclude that the lack of training-induced far transfer is an invariant of human cognition (Sala & Gobet, 2019).
Obviously, this conclusion conflicts with the optimism displayed in the field of cognitive training, as exemplified by Green et al.’s (2019) article discussed above. However, it is in line with skepticism recently expressed about cognitive training (Moreau, 2021; Moreau et al., 2019; Simons et al., 2016). It also raises the following critical epistemological question: Given that the overall evidence in the field of cognitive training strongly suggests that the postulated far-transfer effects do not exist, and thus the probability of finding such effects in future research is very low, should one conclude that the reasonable course of action is to stop performing cognitive-training research on far transfer?
We believe that the answer to this question is “yes.” Given the clear-cut empirical evidence, the discussion about methodological concerns is irrelevant, and the issue becomes searching for other cognitive-enhancement methods. However, although the hope of finding far-transfer effects is tenuous, the available evidence clearly supports the presence of near-transfer effects. In many cases, near-transfer effects are useful (e.g., with respect to older adults’ memory), and developing effective methods for improving near transfer is a valuable—and importantly, realistic—avenue for further research.