No negative Flynn effect in France: Why variations of intelligence should not be assessed using tests based on cultural knowledge. Corentin Gonthier, Jacques Grégoire, Maud Besançon. Intelligence, Volume 84, January–February 2021, 101512. https://doi.org/10.1016/j.intell.2020.101512
Highlights
• We tested the claim that intelligence decreases in France (negative Flynn effect).
• We re-analyzed princeps data (Dutton & Lynn, 2015) and collected a new sample.
• Performance only decreases on tests involving declarative knowledge, not reasoning.
• This is attributable to measurement bias for older items, due to cultural changes.
• There is fluctuation of knowledge, but no overall negative Flynn effect in France.
Abstract: In 2015, Dutton and Lynn published an account of a decrease of intelligence in France (negative Flynn effect) which had considerable societal impact. This decline was argued to be biological. However, there is good reason to be skeptical of these conclusions. The claim of intelligence decline was based on the finding of lower scores on the WAIS-III (normed in 1999) for a recent sample, but careful examination of the data suggests that this decline was in fact limited to subtests with a strong influence of culture-dependent declarative knowledge. In Study 1, we re-analyzed the data used by Dutton and Lynn (2015) and showed that only subtests of the WAIS primarily assessing cultural knowledge (Gc) demonstrated a significant decline. Study 2 replicated this finding and confirmed that performance was constant on other subtests. An analysis of differential item functioning in the five subtests with a decline showed that about one fourth of all items were significantly more difficult for subjects in a recent sample than in the original normative sample, for an equal level of ability. Decline on a subtest correlated 0.95 with its cultural load. These results confirm that there is currently no evidence for a decrease of intelligence in France, with prior findings being attributable to a drift of item difficulty for older versions of the WAIS, due to cultural changes. This highlights the role of culture in Wechsler's intelligence tests and indicates that when interpreting (negative) Flynn effects, the past should really be treated as a different country.
Keywords: Flynn effectNegative Flynn effectFluid intelligenceCrystallized intelligenceDifferential item functioning (DIF)
5. General discussion
The results of both Study 1 and Study 2 unambiguously indicated that there was no negative Flynn effect in France, in the sense of a general decrease of intelligence or a decrease in the ability to perform logical reasoning: there were no reliable differences between WAIS-III and WAIS-IV for any of the subtests reflecting visuo-spatial reasoning (Gf and Gv), or working memory and processing speed (Gsm and Gs), and which were based on abstract materials. We did find lower total performance on the WAIS-III for a recent sample, but contrary to the classic Flynn effect, this difference between cohorts was exclusively driven by the five subtests involving Gc - acquired declarative knowledge tied to a specific cultural setting.
When considered under the angle of item content, it appeared that this decrease on subtests involving declarative knowledge largely reflected, not an actual decrease of ability, but measurement bias due to differences of item difficulty for samples collected at different dates. All in all, in the five subtests demonstrating a decline, about one fourth of items were comparatively more difficult for the 2019 sample than for the 1999 sample for an equal level of ability. These differences could be traced down to a few specific skills. All but one of the Information items that were biased against a recent sample related to the names of famous people, and biased Comprehension items were all related to civic education; interestingly, the test publisher decided to practically eliminate both topics from the WAIS-IV. All but one of the biased Arithmetic items required computing mental division or proportions. For Vocabulary, the negative net effect of bias was partly compensated by the fact that some words were easier in the recent sample, more consistent with a change in language frequency patterns than with an absolute decrease in vocabulary skills. In all cases, these increases in item difficulty for a recent sample could be attributed to environmental changes in school programs, topics covered by the media, and other societal evolutions.
The fact that the performance decrease on a subtest correlated at 0.95 with its cultural load confirms this conclusion and runs counter to the interpretation that the observed decline is caused by biological factors (Woodley of Menie & Dunkel, 2015). This does not completely rule out biological factors, as cultural loads are not pure indicators of cultural influences: a possible alternative interpretation, as suggested by Edward Dutton and Woodley of Menie, is that a genetic decrease in fluid reasoning could negatively affect the culture of a country, in turn reverberating on Gc subtests (see Dutton et al., 2017; this is a variant of investment theory and of explanations assuming genotype-environment covariance; e.g. Kan et al., 2013). However, this idea would be almost impossible to falsify, and it would be difficult to reconcile with the facts that the correlation with heritability was non-significant and that there was no decline at all for the Gf and Gv subtests, which tend to have high heritability (e.g. Kan et al., 2013; Rijsdijk, Vernon, & Boomsma, 2002; van Leeuwen, van den Berg, & Boomsma, 2008), and which would be expected to decrease before effects on Gc could be observable. There is also a lack of plausible biological mechanisms that could create such a large decline in the dataset in such a short timeframe. All this converges to clearly suggest a role of cultural changes as the most parsimonious interpretation of the data.
In short, the conclusion that can be drawn from a comparison of WAIS-III and WAIS-IV is that over the last two decades, there has been no decline of reasoning abilities in the French population, but there has been an average decrease in a limited range of cultural knowledge (essentially related to using infrequent vocabulary words, knowing the names of famous people, discussing civic education and performing mental division), which biases performance on older items. In other words, the data do indicate a lower average performance on the WAIS-III in the more recent sample, in line with Dutton and Lynn (2015) results, but a more fine-grained analysis contradicts their interpretation of a general decrease of intelligence in France. In the terms of a hierarchical model of intelligence (Wicherts, 2007), there appears to be no decrease in latent ability at the first level of g; there is a decrease at the second level of broad abilities, but only for Gc; and this decrease seems essentially due to cultural changes creating measurement bias at the fourth level composed of performance for specific items.
This pattern is entirely distinct from the Flynn effect, which represents an increase in general intelligence, and especially in Gf performance, accompanied by much smaller changes on Gc (Pietschnig & Voracek, 2015). Hence it is our conviction that this pattern reflects substantially different mechanisms and cannot reasonably be labeled a “negative Flynn effect”, without extending the definition of the Flynn effect to the point where any difference between cohorts could be called a “Flynn effect” and where it would no longer be useful as a heuristic concept. This point is compounded by the fact that the difference reflected item-related measurement bias, rather than an actual change of ability. To quote Flynn (2009a): “Are IQ gains ‘cultural bias’? We must distinguish between cultural trends that render neutral content more familiar and cultural trends that really raise the level of cognitive skills. If the spread of the scientific ethos has made people capable of using logic to attack a wider range of problems, that is a real gain in cognitive skills. If no one has taken the trouble to update the words on a vocabulary test to eliminate those that have gone out of everyday usage, then an apparent score loss is ersatz.” The current pattern is clearly ersatz: “ersatz effect” may be a better name than “negative Flynn effect”.
There are two possible interpretations to the ersatz difference observed here. On one hand, this decline could be restricted to areas covered by the WAIS-III, and could be compensated by increases in other areas: in other words, the 2019 sample may possess different knowledge, but not less knowledge than the 1999 sample. On the other hand, this might represent a real decline and a cause for concern: results of the large-scale PISA surveys (performed on about 7.000 pupils) routinely point to significant inequalities in the academic skills of French pupils, and their average level of mathematics performance has declined since the early 2000s (e.g. OECD, 2019). It is impossible to adjudicate between these two possibilities (which would require having the 1999 sample perform the WAIS-IV), but even if there were an actual decrease in average knowledge, this conclusion would be significantly less bleak than the picture of a biologically-driven intelligence decrease painted by Dutton and Lynn (2015), and would highlight possible shortfalls of the French educational system (see also Blair, Gamson, Thorne, & Baker, 2005) rather than the downward trajectory of a population becoming less and less intelligent.
This conclusion is in line with a tradition of studies attributing fluctuations of intelligence scores to methodological biases, especially as they relate to [cultural] item content (e.g. Beaujean & Osterlind, 2008; Beaujean & Sheng, 2010; Kaufman, 2010; Nugent, 2006; Pietschnig et al., 2013; Rodgers, 1998; Weiss et al., 2016). As an example, Flieller (1988) reached the same conclusion in a French dataset over three decades ago; Brand et al. (1989) also found a similar result of decreasing scores due to changes of items difficulty, which they illustrated with an understandable decline of the proportion of correct answers for the item “What is a belfry?” between 1961 and 1984. This conclusion is also in line with studies arguing for the role of cultural environment and culture-based knowledge in Flynn-like fluctuations of intelligence over time (e.g. Bratsberg & Rogeberg, 2018). Note that drifts of item difficulty are only one aspect of such cultural changes; changes of test-taking pattern behavior, such as increased guessing, are another example (e.g. Must & Must, 2013; Pietschnig & Voracek, 2013).
Beyond the specific case of average intelligence in France, the current results constitute a reminder that intelligence scores are not pure reflections of intelligence and have multiple determinants, some of which can be affected by cultural factors that do not reflect intelligence itself. Put otherwise, this is an illustration of the principle that performance can differ between groups of subjects without representing a true difference of ability (Beaujean & Osterlind, 2008; Beaujean & Sheng, 2010). This is a well-known bias of cross-country comparisons, where test performance can be markedly lower in a culture for which the test was not designed (e.g. Cockcroft, Alloway, Copello, & Milligan, 2015; Greenfield, 1997; Van de Vijver, 2016). In other words, this principle generalizes to all comparisons between samples, not just intelligence fluctuations over time: investigators should be skeptical of the origin of between-group differences whenever cultural content is involved. This also applies to clinical psychologists using intelligence tests to compare patients from specific cultural groups to a (culturally different) normative sample.
Seven major recommendations for cross-sample comparisons can be derived from the current results:
1) comparisons based on validity samples collected by the publishers of Wechsler scales have to be avoided due to uncertainties about sample composition (as already stressed by Zhu & Tulsky, 1999; the distribution of ages in Study 1 as represented in Fig. 1 constitutes a stark reminder of this fact);
2) comparisons involving multiple subtests should carefully consider which subtests exactly demonstrate differences, and especially which dimension of intelligence they measure (Gf or Gc?);
3) comparisons between different samples should never be performed using different tests with substantial differences of item content, if there is a possibility that the items will be differentially affected by cultural variables extraneous to ability itself (Kaufman, 2010; Weiss et al., 2016);
4) even when the same version of a test involving cultural content is used, differences between samples collected at different dates in the same country should be treated as if the past sample were from a different country, due to the possibility of differential item functioning emerging over time;
5) as a consequence, comparisons between samples should primarily rely on tests that involve as little contribution of culture-based declarative knowledge as possible, such as Raven's matrices (e.g. Flynn, 2009b);
6) when only tests requiring culture-based declarative knowledge are available, differences should necessarily be interpreted taking into account possible measurement bias. The issue of measurement bias can be considered under the prism of IRT as a way to separate item parameters from ability estimates and test for DIF, and/or using multigroup confirmatory factor analyses as a way to more accurately specify at which level of a hierarchical model of intelligence samples actually differ (Wicherts et al., 2004);
7) lastly, and as exemplified by the pattern of correlations between performance decline, heritability and g-loadings, and cultural load, no conclusions about the biological origin of between-group differences in test scores can be drawn without also testing the role of cultural factors.