Tuesday, March 14, 2023

The Political Biases of ChatGPT

The Political Biases of ChatGPT. David Rozado. Soc. Sci. 2023, 12(3), 148; Mar 2023. https://doi.org/10.3390/socsci12030148

Abstract: Recent advancements in Large Language Models (LLMs) suggest imminent commercial applications of such AI systems where they will serve as gateways to interact with technology and the accumulated body of human knowledge. The possibility of political biases embedded in these models raises concerns about their potential misusage. In this work, we report the results of administering 15 different political orientation tests (14 in English, 1 in Spanish) to a state-of-the-art Large Language Model, the popular ChatGPT from OpenAI. The results are consistent across tests; 14 of the 15 instruments diagnose ChatGPT answers to their questions as manifesting a preference for left-leaning viewpoints. When asked explicitly about its political preferences, ChatGPT often claims to hold no political opinions and to just strive to provide factual and neutral information. It is desirable that public facing artificial intelligence systems provide accurate and factual information about empirically verifiable issues, but such systems should strive for political neutrality on largely normative questions for which there is no straightforward way to empirically validate a viewpoint. Thus, ethical AI systems should present users with balanced arguments on the issue at hand and avoid claiming neutrality while displaying clear signs of political bias in their content.

Keywords: algorithmic bias; political bias; AI; large language models; LLMs; ChatGPT; OpenAI

4. Discussion

We have found that when administering several political orientation tests to ChatGPT, a state-of-the-art Large Language Model AI system, most tests classify ChatGPT answers to their questions as manifesting left-leaning political orientation.
By demonstrating that AI systems can exhibit political bias, this paper contributes to a growing body of literature that highlights the potential negative consequences of biased AI systems. Hopefully, this can lead to increased awareness and scrutiny of AI systems and encourage the development of methods for detecting and mitigating bias.
Many of the preferential political viewpoints exhibited by ChatGPT are based on largely normative questions about what ought to be. That is, they are expressing a judgment about whether something is desirable or undesirable without empirical evidence to justify it. Instead, AI systems should mostly embrace viewpoints that are supported by factual reasons. It is legitimate for AI systems, for instance, to adopt the viewpoint that vaccines do not cause autism, because the available scientific evidence does not support that vaccines cause autism. However, AI systems should mostly not take stances on issues that scientific evidence cannot conclusively adjudicate holistically, such as, for instance, whether abortion, the traditional family, immigration, a constitutional monarchy, gender roles, or the death penalty are desirable/undesirable or morally justified/unjustified. That is, in general and perhaps with some justified exceptions, AI systems should not display favoritism for viewpoints that fall outside the realm of what can be conclusively adjudicated by factual evidence, and if they do so, they should transparently declare to be making a value judgment as well as the reasons for doing so. Ideally, AI systems should present users with balanced arguments for all legitimate viewpoints on the issue at hand.
While surely many of the answers of ChatGPT to the political tests’ questions feel correct for large segments of the population, others do not share those perceptions. Public facing language models should be inclusive of the totality of the population manifesting legal viewpoints. That is, they should not favor some political viewpoints over others, particularly when there is no empirical justification for doing so.
Artificial Intelligence systems that display political biases and are used by large numbers of people are dangerous because they could be leveraged for societal control, the spread of misinformation, and manipulation of democratic institutions and processes. They also represent a formidable obstacle towards truth seeking.
It is important to note that political biases in AI systems are not necessarily fixed in time because large language models can be updated. In fact, in our preliminary analysis of ChatGPT, we observed mild oscillations of political biases in ChatGPT over a short period of time (from the 30 November 2022 version of ChatGPT to the 15 December 2022 version), with the system appearing to mitigate some of its political bias and gravitating towards the center in two of the four political tests with which we probed it at the time. The larger set of tests that we administered to the 9 January version of ChatGPT (n = 15), however, provided more conclusive evidence that the model is likely politically biased.
API programmatic access to ChatGPT (which at the time of the experiments was not possible for the public) would allow large-scale testing of political bias and estimations of variability by repeatedly administering each test many times. Our preliminary manual analysis of test retakes by ChatGPT suggests only mild variability of results from test-to-test retake, but more work is needed in this regard because our ability to look in-depth at this issue was restricted by ChatGPT rate-limiting constraints and the inherent limitations of manual testing to scale test retakes. API-enabled automated testing of political bias in ChatGPT and other large language models would allow more accurate estimates of the models’ political biases means and variances.
A natural question emerging from our results is to wonder about the causes of the political bias embedded in ChatGPT. There are several potential sources of bias for this model. Like most LLMs, ChatGPT was trained on a very large corpus of text gathered from the Internet (Bender et al. 2021). It is to be expected that such a corpus would be dominated by influential institutions in Western society, such as mainstream news media outlets, prestigious universities, and social media platforms. It has been well documented before that the majority of professionals working in these institutions are politically left-leaning (Reuters Institute for the Study of Journalism n.d.Hopmann et al. 2010Weaver et al. 2019Langbert 2018Archive, View Author, and Get Author RSS Feed 2021Schoffstall 2022American Enterprise Institute—AEI (blog) n.d.The Harvard Crimson n.d.). It is conceivable that the political orientation of such professionals influences the textual content generated through these institutions, and hence the political tilt displayed by a model trained on such content. Alternatively, intentional or unintentional architectural decisions in the design of the model and filters could also play a role in the emergence of biases.
Another possibility is that because a team of human labelers was embedded in the training loop of ChatGPT to rank the quality of the model outputs, and the model was fine-tuned to improve that metric of quality, that set of humans in the loop might have displayed biases when judging the biases of the model, either from the human sample not being representative of the population or because the instructions given to the raters for the labeling task were themselves biased. Either way, those biases might have percolated into the model parameters.
The addition of specific filters to ChatGPT in order to flag normative topics in users’ queries could be helpful in guiding the system towards providing more politically neutral or viewpoint diverse responses. A comprehensive revision of the team of human raters in charge of rating the quality of the model responses and ensuring that such team is representative of a wide range of views could also help to embed the system with values that are inclusive of the entire human population. Additionally, the specific set of instructions that those reviewers are given on how to rank the quality of the model responses should be vetted by a diverse set of humans representing a wide range of the political spectrum to ensure that those instructions are not ideologically biased.
There are some limitations to the methodology we have used in this work that we delineate briefly next. Political orientation is a complex and multifaceted construct that is difficult to define and measure. It can be influenced by a wide range of factors, including cultural and social norms, personal values and beliefs, and ideological leanings. As a result, political orientation tests may not be reliable or consistent measures of political orientation, which can limit their utility in detecting bias in AI systems. Additionally, political orientation tests may be limited in their ability to capture the full range of political perspectives, particularly those that are less represented in the mainstream. This can lead to biases in the tests’ results.
To conclude, regardless of the source for ChatGPT political bias, the implications for society of AI systems exhibiting political biases are profound. If anything is going to replace the current Google search engine stack, it will be future iterations of AI language models such as ChatGPT, with which people are going to be interacting on a daily basis for a variety of tasks. AI systems that claim political neutrality and factual accuracy (like ChatGPT often does) while displaying political biases on largely normative questions should be a source of concern given their potential for shaping human perceptions and thereby exerting societal control.

Sunday, March 12, 2023

Although having a good reputation was associated with receiving more benefits, almost all women scoring higher than almost all men on a dimension involving better parenting, good reputations, & receipt of more benefits

The impact of gossip, reputation, and context on resource transfers among Aka hunter-gatherers, Ngandu horticulturalists, and MTurkers. Nicole H. Hess, Edward H. Hagen. Evolution and Human Behavior, March 11 2023. https://doi.org/10.1016/j.evolhumbehav.2023.02.013

Abstract: Theoretical models of gossip's role in the evolution of cooperation in ancestral human communities, and its role in within-group competition for resources, require gossip to cause changes in individuals' reputations, which then cause changes in the likelihood of their receiving benefits. However, there is scant experimental evidence from small-scale societies supporting such causal relationships. There is also little experimental evidence that, when making decisions about the transfer of resources, gossip receivers weigh gossip according to its relevance to the social context in which such transfers occur. Using an experimental vignette study design, in a sample from MTurk (N = 120) and another sample from a remote horticultural population, the Ngandu of the Central African Republic (CAR) (N = 160), we test whether positive and negative gossip increase and decrease the likelihood of transferring resources, respectively, mediated by their effects on reputation. We also test whether gossip that is relevant to the context of the resource transfer has a larger impact on reputation than other gossip. We found strong significant, context-relevant effects of gossip on participant willingness to transfer benefits, mediated by gossip's effects on reputation. Then, in an exploratory observational study of Aka hunter-gatherers of CAR using peer-reports (N = 40), we investigate whether providing benefits to the group (such as working hard, parenting or alloparenting, or sharing) and genetic relatedness to the group, were associated with reputations and receiving benefits. We found that, although having a good reputation was associated with receiving more benefits, there was a stark sex difference, with almost all women scoring higher than almost all men on a dimension involving better parenting, good reputations, and receipt of more benefits.

Introduction

Humans evolved in groups that cooperated to obtain food, defend themselves from predators and other humans, and care for children, the injured, and the sick (Martin, Ringen, Duda, & Jaeggi, 2020; Ringen, Duda, & Jaeggi, 2019; Sugiyama, 2004). Some benefits, such as defense from predators and enemies, were non-excludable public goods – all group members would necessarily obtain the benefit. Other benefits, though, such as food and care, were potentially excludable – they could be distributed unequally to group members. Successful hunters could provide more meat to their wives and children, for instance, although the extent to which this happens in contemporary foraging societies is fiercely debated (Blurton Jones, 1987; Hawkes, O'Connell, & Blurton Jones, 2014; Jaeggi & Gurven, 2013; Ringen et al., 2019; Stibbard-Hawkes, 2019; Stibbard-Hawkes, Attenborough, Mabulla, & Marlowe, 2020; Wood & Marlowe, 2013). As another example, Rucas, Gurven, Kaplan, and Winking (2010) found that Tsimane women excluded resources from women with whom they had disputes or conflicts compared to favored female neighbors or desired friends. Studies in high income countries find that individuals perceived as lazy are seen as less deserving of resource transfers, such as welfare payments, than are victims of misfortune, and these perceptions influence social policies (Jensen & Petersen, 2017; Petersen, 2012).

Inclusive fitness is a compelling explanation for the provisioning of excludable benefits within families, such as food, alloparenting, and care of the sick and injured. Indeed, intergenerational transfers of material, embodied, and relational wealth within families establish and maintain inequality in a wide range of small-scale societies (Mulder et al., 2009). Yet levels of inequality in foraging and horticultural societies, specifically, are relatively low (Mulder et al., 2009). This is despite the fact that relatedness within such communities, which comprise a fluid mix of genetic kin, affines, and unrelated adults, is generally low (Dyble et al., 2015; Hill et al., 2011).

A diverse group of theories has been proposed to explain the willingness to provide resources to unrelated community members, including reciprocal altruism (Allen-Arave, Gurven, & Hill, 2008; Jaeggi & Gurven, 2013), investing in those who provide valuable group benefits (Gurven, Allen-Arave, Hill, & Hurtado, 2000; Sugiyama, 2004; Sugiyama & Chacon, 2000; Sugiyama & Sugiyama, 2003), providing resources to others as a costly signal of quality (the ‘show-off’ models) (Bliege Bird & Smith, 2005; Gintis, Smith, & Bowles, 2001; Hawkes & Bliege Bird, 2002; Stibbard-Hawkes, 2019), risk-buffering and fitness interdependence (Aktipis et al., 2018; Smith et al., 2019), and indirect reciprocity (Alexander, 1986; Balliet, Wu, & van Lange, 2020; Leimar & Hammerstein, 2001; Nowak & Sigmund, 2005).

In several of these theories, in order to receive benefits from others, individuals must have a “good” reputation. Reputation is based on information about one's traits, behaviors, intentions, abilities, and culturally-relevant competencies. A study of 153 cultures in the ethnographic record investigated the evidence for 20 domains of reputation identified in the theoretical literature. Domains that were widely supported across cultures included cultural conformity (conforming to cultural norms or excelling in culturally-valued skills), being knowledgeable, intelligent, prosocial, and industrious, and having social status. These domains formed clusters, with the most cross-cultural evidence for cultural group unity (e.g., cultural conformity, prosociality, and industriousness), social and material success (social and material capital and status), and neural capital (knowledgeable, oratory skill) (Garfield et al., 2021).

Information on the degree to which individuals in one's community excel or fall short on each of these reputational domains or contexts can be obtained via direct observation, or from other individuals in the community, i.e., gossip. Several theories have been put forward for the evolution of gossip, including ‘cultural learning’; ‘social learning,’ such as learning norms or one's place in a group or acquiring new and important knowledge; strategy learning; social comparison; a mechanism for showing off one's social skill and connections, and therefore one's mate value; norm learning and enforcement; sanctioning, social control, or ‘policing’; a means to maintain the good reputations of allies; and as a means to maintain the unity, morals, and values of social groups (reviewed in Hess & Hagen, 2019). One early attempt to explain the relationship between gossip and cooperation comes from Dunbar (1996), who suggested that because grooming would be too time-consuming in the large groups that are typical of humans, gossip replaced it as a means to create and maintain social bonds. However, a recent study found no support for the ‘vocal grooming’ hypothesis as a less time-consuming means of bonding (Jaeggi et al., 2017).

The key role of gossip and reputation in the evolution of human cooperation, especially via indirect reciprocity, is starting to receive considerable attention (Balliet et al., 2020; Wu, Balliet, & Van Lange, 2016b). Gossip has been demonstrated to increase cooperation via indirect reciprocity in experimental economics games (e.g., Sommerfeld, Krambeck, Semmann, & Milinski, 2007) where reputational information impacts contributions to a shared pool of resources (e.g., Beersma & Van Kleef, 2011), or where information about the past behaviors of cooperative partners impacts participants' inclinations to engage in future cooperation (e.g., Feinberg, Willer, & Schultz, 2014). Cooperators in public goods games, in turn, transmit more honest gossip (Giardini, Vilone, Sánchez, & Antonioni, 2021). Gossip was found to be more effective and efficient than punishment in promoting and maintaining cooperation in a public goods game (Wu, Balliet, & Van Lange, 2016a), and gossip also increases cooperation in the dictator and ultimatum games (Wu, Balliet, Kou, & Van Lange, 2019). However, a confederate's negative gossip about a third party did not enhance participant cooperation in a prisoner's dilemma game (De Backer, Larson, Fisher, McAndrew, & Rudnicki, 2016). In addition, agent-based simulations have explored how varying the quantity and quality of gossip impacts cooperation (Giardini, Paolucci, Villatoro, & Conte, 2014; Giardini & Vilone, 2016).

When reputation mediates access to group resources, competition for those resources by group members will often take the form of gossip that aims to increase one's reputation relative to that of competitors. A considerable body of evidence from industrialized populations demonstrates that gossiping is a key strategy in indirect aggression, the suite of behaviors that are used to harm others but that do not involve hitting or other types of physical force (for reviews, see Archer & Coyne, 2005; Hess & Hagen, 2019). Ethnographic studies of gossip find that it is often used in reputation management, i.e., maintaining and improving one's reputation relative to others (Hess, 2017). In a study among Aka, for example, a Congo Basin hunter-gatherer population, peer-rated gossiping was strongly positively correlated with peer-rated anger for both women and men, confirming that Aka perceive gossip as aggressive (Hess, Helfrecht, Hagen, Sell, & Hewlett, 2010). Several studies with US, multinational online, and non-Western samples have also found that gossip is used to either obtain or defend social resources, such as friends and mates (Fisher & Cox, 2011; Krems, Williams, Aktipis, & Kenrick, 2020; Rucas, 2017; Rucas et al., 2006; Stone, 2015; Sutton, 2014; Sutton & Oaten, 2017). Regarding material resources, an experimental vignette study with an MTurk sample found competition for a limited material resource increased gossip, especially negative gossip (Hess & Hagen, 2021), and among North American women a resource scarcity prime increased rival derogation (Arnocky, Davis, & Vaillancourt, 2022).

Campbell (1999) proposed that because the costs of physical aggression are higher for women, female aggression is more likely to take the form of indirect aggression, such as negative gossip. Influenced by Campbell, many evolutionary studies of gossip and competition have therefore focused on women (for reviews, see Fisher, 2017; McAndrew, 2017; Reynolds, 2022). However, the link between indirect aggression and female competition specifically is complicated by the finding that there are few sex differences in indirect aggression (Archer & Coyne, 2005).

Alternatively, Hess and Hagen (2019) proposed that in competition over resources within interdependent groups, negative gossip is more effective than physical aggression for both sexes because one can reduce resource transfers to a competitor by harming his or her reputation, thereby increasing resource availability for oneself, without impairing the competitor's physical ability to continue contributing to the group. Positive gossip by either sex could increase transfers to a relative, or ally by improving his or her reputation. This perspective does not predict sex differences in within-group competitive gossip.

Because individuals' reputations can differ in different social contexts (Garfield et al., 2021), reputation-based decisions to gossip about others, or to provide benefits, should be sensitive to the context in which competition or help is occurring. To influence resource transfers within families, for instance, one should relay gossip that is relevant to family members, and to influence resource transfers within communities, one should relay gossip that is relevant to community members. In an experimental study involving competition over limited resources in a family vs. work context, Hess and Hagen (2021) found, as predicted, that individuals transmitted more family gossip in a family context and more work gossip in a work context.

The theoretical models of the role of gossip in the evolution of cooperation in ancestral human communities require gossip to cause changes in individuals' reputations,which then cause changes in the likelihood of providing benefits to them. However, there is scant evidence from small-scale societies of such causal relationships. There is also no evidence that, when making a decision about the transfer of resources, gossip receivers weigh gossip according to its relevance to the social context.

Saturday, March 11, 2023

Gender differences in competitiveness and fear of failure help explain why girls have lower life satisfaction than boys in gender equal countries

Gender differences in competitiveness and fear of failure help explain why girls have lower life satisfaction than boys in gender equal countries. Kimmo Eriksson and Pontus Strimling. Front. Psychol., March 9 2023, Volume 14 - 2023. https://doi.org/10.3389/fpsyg.2023.1131837

Abstract: Among 15-year-olds, boys tend to report higher life satisfaction than girls. Recent research has shown that this gender gap tends to be larger in more gender-egalitarian countries. We shed light on this apparent paradox by examining the mediating role of two psychological dispositions: competitiveness and fear of failure. Using data from the 2018 PISA study, we analyze the life satisfaction, competitiveness, and fear of failure of more than 400,000 15-year-old boys and girls in 63 countries with known levels of gender equality. We find that competitiveness and fear of failure together mediate more than 40 percent of the effects on life satisfaction of gender and its interaction with gender equality. Thus, interventions targeting competitiveness and fear of failure could potentially have an impact on the gender gap in life satisfaction among adolescents in gender equal countries.

Discussion

In this paper, we have addressed the gender-equality paradox in adolescent life satisfaction. How can it be that the difference in life satisfaction between adolescent boys and girls is larger in more gender-equal countries? We proposed a novel kind of explanation: that the gender-equality paradox for life satisfaction can be pushed back to similar paradoxes for psychological dispositions that affect life satisfaction. Specifically, we proposed that fear of failure and competitiveness may play this mediating role.

We examined the life satisfaction, fear of failure, and competitiveness of 15-year-olds in 63 countries around the world. Replicating prior findings, we observed a male advantage in all three measures. In other words, boys are more satisfied with their life than girls are (Inchley et al., 2016), boys experience less fear of failure than girls do (Borgonovi and Han, 2021), and boys are more competitive than girls (Boneva et al., 2022). Moreover, in more gender-equal countries, we observe wider gender gaps in life satisfaction (Campbell et al., 2021), fear of failure (Borgonovi and Han, 2021), and competitiveness (Napp and Breda, 2022). Our study appears to be the first to examine these phenomena simultaneously.

We found the correlation between gender gaps and gender equality to be stronger for psychological dispositions (fear of failure and competitiveness) than for life satisfaction. This finding is in keeping with our hypothesis that the pathway by which gender and gender equality affect life satisfaction goes via psychological dispositions (Figure 2). More evidence for this hypothesis was obtained in a mediation analysis, which showed that fear of failure and competitiveness could account for 40 percent of the effects on life satisfaction of gender and its interaction with gender equality.

This work has both theoretical and practical implications. From a theoretical point of view, we have demonstrated a way to connect different instances of the gender-equality paradox. If X and Y are two individual-level variables and X influences Y, then the gender-equality paradox may hold for Y simply because it holds for X. In the case studied in the present research, X were certain dispositions and Y was life satisfaction. The same logic could potentially be used to connect other of the numerous instances of the gender-equality paradox. For example, it could be that the gender-equality paradox in personality (Costa et al., 2001) underlies several other instances of the paradox for variables that are influenced by personality. This is a topic for future research.

From a practical point of view, our findings describe both a problem and a potential solution. The problem is that girls’ life satisfaction is especially low in gender-equal societies. The potential solution is that interventions that target girls’ low competitiveness and high fear of failure in these societies could also be a way of achieving greater life satisfaction. How to conduct such interventions is beyond the scope of this study. The literature on interventions includes studies on both competitiveness (Boneva et al., 2022) and fear of failure (Stamps, 1973Martin and Marsh, 2003), which may be a starting point.

An important limitation of the current study is that it relies on cross-sectional data, which do not provide information on causal directions. Thus, from the data we cannot exclude the possibility that the causal direction goes in reverse, that is, that young people’s level of life satisfaction may influence their competitiveness and fear of failure. Intervention studies may also provide evidence for our working assumption about the causal direction.

Another limitation is that life satisfaction was measured by a single item. Such measures are sensitive to individual differences in response style. Response style could similarly bias the measure of competitiveness, because although it is measured using several items, they are all coded in the same direction. The same goes for fear of failure. However, it is unlikely that our findings are driven by response style, as competitiveness and lack of fear of failure are coded in different directions yet give similar results.

Self-selection biases in psychological studies: Personality and affective disorders are prevalent among participants

Kaźmierczak I, Zajenkowska A, Rogoza R, Jonason PK, Ścigała D (2023) Self-selection biases in psychological studies: Personality and affective disorders are prevalent among participants. PLoS ONE 18(3): e0281046, Mar 8 2023. https://doi.org/10.1371/journal.pone.0281046

Abstract: Respondents select the type of psychological studies that they want to participate in consistence with their needs and individual characteristics, which creates an unintentional self-selection bias. The question remains whether participants attracted by psychological studies may have more psychological dysfunctions related to personality and affective disorders compared to the general population. We investigated (N = 947; 62% women) whether the type of the invitation (to talk about recent critical or regular life events) or the source of the data (either face-to-face or online) attracts people with different psychopathology. Most importantly, participants who alone applied to take part in paid psychological studies had more symptoms of personality disorders than those who had never before applied to take part in psychological studies. The current results strongly translate into a recommendation for either the modification of recruitment strategies or much greater caution when generalizing results for this methodological reason.

Discussion

The main aim of this project was to investigate self-selection biases related to the prevalence of personality disorders in psychological studies. We tested whether different types of research invitations attract different research participants in terms of their psychopathology. Indeed, people who replied to an advertisement on a study on a negative critical life event and its psychological consequences that took place up to two months before the research and led them to low mood had not only more personality disorders (PDs), but also the number of symptoms for different types of PDs compared to those who volunteered for a study on a regular life event and non-volunteers. Also, participants who replied to an advertisement on a study on a recent negative critical life event without the low mood requirement had more symptoms than those who volunteered for a study on a regular life event and non-volunteers. Still those who never participated in research before (i.e. non-volunteers) were likely to show the least symptoms of PDs compared to those who did, suggesting that people with the healthiest structure of personality (and reflecting the general population) are not usually included in research samples or are relatively rarely.

At the same time, personality disorders (more numerous and higher in volunteers) are associated with rigid (and maladaptive) beliefs and the resulting inflexible behavioral patterns [3435]), which may be of great importance in experimental research, particularly while interpreting the effectiveness of an experimental manipulation, but also in the identification and/or description of any psychological phenomena. Many studies show that participants with PDs demonstrate specific attentional coping styles [36] and biased attention to emotions and facial expressions [3738], which might interplay with all experimental procedures.

Comparing the studied groups to the general population, it should be noted that both the ratio of participants who met the criteria of a PD (from six to 75% depending on a type of PDs and an advertised study) and the number of clinically diagnosed PDs (from three to six coexisting PDs dependent on a type of an advertised study) are unexpected outcomes. All comparison groups differed to a greater or lesser extent from the distribution of PDs in the general population, regardless the wide range of the results. Studies show that it ranges from 4.4% to 13.4% [1439] for the European population and from 9.0% to 21.5% [340] for the United States population. The highest overall prevalence of PDs (equal to 45.5%) has been identified amongst psychiatric patients [31], and it is still lower than in one of our advertised study (i.e., after a critical life event that took place up to two months before the research and led to low mood).

In addition, the prediction that the personality organization of both online participants and volunteers who applied for different types of face-to-face studies will be more pathological in comparison with non-volunteers was verified in the project. However, there were no differences in the intensity of Borderline PD symptoms, nonetheless volunteers who participated in a critical life event study and with low mood were diagnosed with this disorder most frequently. Although such key words included in the research invitation as “low mood” and “negative critical life event” had the power to attract people with particularly increased personality psychopathologies, it should be noted that all volunteers, regardless the experimental group, were characterized with its higher level. It might suggest that apart from participation in a study, they might indirectly seek for a psychological help and for a reason, however this hypothesis requires further investigation.

Furthermore, online participants were higher on depression and anxiety (their mean scores indicate clinical “caseness” using the cut-off of 11 points suggested by Zigmond and Snaith [31] as compared to those who never participated in research before (their mean scores indicate a borderline level for depression and clinical “caseness” for anxiety; [31] This finding is in line with some previous studies [13], however it is important to acknowledge that the online survey was conducted during COVID-19, which might have an aversive impact on participants. As Bueno-Notivol et al. [41] showed in their meta-analysis, the prevalence of depression (25%) in COVID-19 is even seven times higher compared to its global estimation in 2017. At the same time, although it causes a methodological restriction to adequately compare all groups, we cannot ignore the fact that all data collected during the pandemic is based on this specific research samples (e.g., most research is done remotely). Hence, paradoxically, the Internet sample from our study delivers a characteristic of the subjects who we are actually being studied now.

Additionally, our study was conducted in the early stages of the COVID-19 pandemic. This means that the heightened state of anxiety that emerged in almost everyone was not able to alter enduring dispositional personality traits. As longitudinal studies have shown, change is a dynamic and temporal process. A longitudinal study of Caldioroli et al. [42] involving 166 individuals affected by different psychiatric disorders at three time points (t0 as pandemic outbreak, t1 as lockout period, t2 as re-opening) showed significant deterioration during the lockout period with little improvement during the re-opening. Moreover, only psychopathology in patients with schizophrenia and obsessive-compulsive symptoms were not significantly improved at t2. Individuals with PDs were at higher risk for overall psychopathology than those with depression and anxiety/obsessive-compulsive and exhibited more severe anxiety symptoms than schizophrenic patients.

Summing up, as (1) volunteers vary in terms of psychopathology depending on the type of both invitation and study they wanted to participate in, and also differ from those individuals who usually do not come to psychological research, (2) there is no basis for assuming that the presented findings are an isolated case. Hence, it is advisable to interpret all psychological research outcomes considering the impact of the form of invitation to the research and the type of research itself on the potential psychopathology of the participants. Moreover, the research outcomes need to be interpreted in close connection with the finding of larger psychopathology of the volunteers compared to non-volunteers. This conclusion translates into a recommendation for either the modification of recruitment strategies or much greater caution when generalizing results for this methodological reason.

Friday, March 10, 2023

The COVID-19 epidemic is accompanied by substantially and significantly lower intelligence test scores

Breit M, Scherrer V, Blickle J, Preckel F (2023) Students’ intelligence test results after six and sixteen months of irregular schooling due to the COVID-19 pandemic. PLoS ONE 18(3): e0281779, Mar 8 2023. https://doi.org/10.1371/journal.pone.0281779

Abstract: The COVID-19 pandemic has affected schooling worldwide. In many places, schools closed for weeks or months, only part of the student body could be educated at any one time, or students were taught online. Previous research discloses the relevance of schooling for the development of cognitive abilities. We therefore compared the intelligence test performance of 424 German secondary school students in Grades 7 to 9 (42% female) tested after the first six months of the COVID-19 pandemic (i.e., 2020 sample) to the results of two highly comparable student samples tested in 2002 (n = 1506) and 2012 (n = 197). The results revealed substantially and significantly lower intelligence test scores in the 2020 sample than in both the 2002 and 2012 samples. We retested the 2020 sample after another full school year of COVID-19-affected schooling in 2021. We found mean-level changes of typical magnitude, with no signs of catching up to previous cohorts or further declines in cognitive performance. Perceived stress during the pandemic did not affect changes in intelligence test results between the two measurements.

Discussion

Intelligence test results were lower in the pandemic 2020 sample than in the prepandemic 2002 and 2012 samples. The differences in test scores were large, with a difference in general intelligence of 7.62 IQ points between 2020 and 2002 (Analysis 1a). This difference did not appear to be a continuation of a longer decreasing trend. In contrast, we observed larger test scores in 2012 than in 2002 but lower scores in 2020. The difference between 2012 and 2020 was also substantial, with a difference in general intelligence of 6.54 points (Analysis 1b). The cross-sectional cohort comparisons therefore seem to corroborate previous results that regular schooling has a substantial impact on intelligence development and its absence is detrimental for intelligence test performance [9]. The difference in test scores was remarkably large. It may be the case that the student population was hit particularly hard by the pandemic, having to deal with both the disruption of regular schooling and other side effects of the pandemic, such as stress, anxiety, and social isolation [68]. Moreover, students are usually very accustomed to testing situations, which may be less the case after months of remote schooling.

Creativity scores were notably lower than other scores in 2002. It therefore seems like the nonsignificant difference in creativity between 2002 and 2020 was not due to creativity being unaffected by the pandemic, but instead due to creativity scores being low in 2002. This is supported by significantly higher creativity scores in 2012. Lower creativity in 2002 than in later years may be due to unfamiliarity with the testing format, changes in curricula, or changes in out of school activities.

Importantly, the overall results are inconsistent with one possible alternative explanation of decreasing intelligence test scores, namely, a reverse Flynn effect. Flynn observed a systematic increase in intelligence scores across generations in the 20th century [69]. In some countries, a reversed Flynn effect with decreasing intelligence scores across generations has been observed in recent years [177071]. This seems to be an especially plausible alternative explanation for the observed differences in test scores in our Analysis 1a. However, there are arguments against this alternative explanation. A reversal of the Flynn effect has not yet been observed in Germany. Instead, even in recent years, a regular positive Flynn effect has been reported [4572]. Moreover, a reverse Flynn effect is also inconsistent with our observation of increasing test scores from 2002 to 2012. We observed an increase in General Intelligence equivalent to .47 IQ points per year, which is slightly larger than the typically observed Flynn effect [73] or the Flynn effect observed in Germany [45]. The observed decrease in test scores from 2012 to 2020 with .82 IQ points per year for General Intelligence is also much larger than the reverse Flynn effect observed elsewhere (.32 IQ points) [74], making it unlikely that this effect alone could account for the observed decline.

The longitudinal results (Fig 9) showed an increase in test scores between the test (2020) and retest (2021). The magnitude of the increase is in line with the retest effects for intelligence testing that have been quantified meta-analytically (d = .33) [46]. In some cases the retest effects were larger than expected based on the meta-analysis (e.g., Processing Speed, Figural Ability). However, these cases were largely in line with a previous investigation of retest effects in a subsample of the BIS-HB standardization sample, [75] with no clear pattern of consistently larger or smaller retest effects in the present sample. These results indicate neither a remarkable decrease nor a “catching up” to previous cohorts.

Interestingly, we found no impact of perceived stress on the change in intelligence test scores. A possible explanation for the observed results is that stress levels were especially high in the first months of the pandemic, when there was the greatest uncertainty about the nature of the disease and lockdowns and school closures were novel experiences. Some evidence for a spike in stress levels at the beginning of the pandemic comes from tracking stress-related migraine attacks [76] and from a longitudinal survey of college students that was conducted in April and June 2020, finding the highest stress levels in April [77]. Moreover, teachers and students were both completely unprepared for school closures and online teaching at the beginning of the pandemic. The retest was conducted after a month-long period of regular schooling, followed by a now more predictable and better prepared switch to remote schooling that did not catch teachers and students off guard entirely. These factors may explain why intelligence performance did not drop further and why stress levels did not have an effect on the change in performance in the second test.

Strengths and limitations

The present study has several strengths. To our knowledge, this is the first investigation of the development of intelligence test performance during the pandemic. Moreover, we used a relatively large, heterogeneous sample and a comprehensive, multidimensional intelligence test. We were able to compare the results of our sample with two highly similar prepandemic samples using propensity score matching. Last, we retested a large portion of the sample to longitudinally investigate the development of intelligence during the pandemic.

However, the present study also has several limitations that restrict the interpretation of the results. First, due to the pandemic affecting all students, we were not able to use a control group but had to rely on samples collected in previous years. Cohort effects cannot be completely excluded, although we tried to minimize their influence through propensity score matching and the use of two different prepandemic comparison groups. We could not control for potential differences in socioeconomic status (SES) between the samples because no equivalent measure was used in all three cohorts. It would have been beneficial to control for SES because of its influence on cognitive development and on the bidirectional relationship of intelligence and academic achievement [9]. SES differences between samples therefore may account for some of the observed test score differences. However, large differences in SES between the samples are unlikely because the 2012 and 2020 samples were drawn from the same four schools. Regarding the impact of SES on the longitudinal change during the pandemic in the 2020 sample, we did not have a comprehensive SES measure available. However, we had information on the highest level of education of parents. When adding this variable as a predictor in the LCA analyses, the results did not change, and parents’ education was no significant predictor of change.

Second, both measurement points of the study fell within the pandemic. A prepandemic measurement is not available for our 2020 sample. This limits the interpretation of the change in test scores over the course of the pandemic, even though we compared the observed retest effects with those found in meta-analysis and a previous retest-study of the BIS-HB.

Third, the 2020 measurement occurred only a few weeks after the summer break. It has often been shown that the summer break causes a decrease in math achievement test scores [78] as well as intelligence test scores [79]. However, this “summer slide” effect on intelligence seems to be very modest in size [80] and is therefore unlikely to be fully responsible for the large observed cohort differences in the present investigation.

Fourth, perceived stress was only measured by a short, retrospective scale. The resulting scores may not very accurately represent the actual stress levels of the students over the school year. Moreover, perceived stress was not measured at the first measurement point, so changes in stress levels during the pandemic could not be examined. This limits the interpretation of the absence of stress effects on changes in intelligence.

Fifth, the matched groups in Analysis 1b were somewhat unbalanced with regard to grade level (Table 1). The students in the 2020 sample tended to be in higher grades while being the same age. However, this pattern is unlikely to explain the differences in intelligence. The students in the 2020 sample tended to have experienced more schooling at the same age than the other samples, which would be expected to be beneficial for intelligence development [1011].

Sixth, there was some attrition between the first and second measurement of the 2020 sample. This was due to students changing schools or school classes, being sick or otherwise absent on the second day of testing or failing to provide parental consent for the second testing. It may be plausible that especially students with negative motivational or intellectual development changed school or avoided the second testing. This means that the improvement between the first and second time of measurement may be somewhat overestimated in the present analyses.

Seventh and last, only a modest percentage of the samples were matched in the PSM procedure because we followed a conservative recommendation for the caliper size [55] that yielded a very balanced matching solution. The limited common support somewhat diminishes the generalizability of the findings to the full samples.

Implications

The pandemic and the associated countermeasures affected the academic development of an entire generation of students around the world, as evidenced by decreases in academic achievement [3]. Simulations predict a total learning loss between .3 and 1.1 school years, a loss valued at approximately $10 trillion [81]. Although we cannot make any causal claims with the present study, our results suggest that these problems might extend to students’ intelligence development. They point out that possible detrimental effects especially took place during the first months of the pandemic. Moreover, our longitudinal results do not point to any recovery effects.

As schooling has a positive impact on students’ cognitive development, educational institutions worldwide have a chance to compensate for such negative effects in the long term. As interventions aimed at the improvement of academic achievement also affect intelligence, [9] the decline in intelligence could be recovered if targeted efforts are made to compensate for the deficit in academic achievement that has occurred. Furthermore, schools could pay attention to offering intellectually challenging lessons or supplementary programs in the afternoons or during vacations, as intellectually more stimulating environments have a positive effect on intelligence development [82].

A second implication concerns current intelligence testing practice. If there is a general, substantial decrease in intelligence test performance, testing with prepandemic norms will lead to an underestimation of the percentile rank (and thus IQ) of the person being tested. This can have significant consequences. For example, some giftedness programs use IQ cutoffs to determine eligibility. Fewer students tested during (or after) the pandemic may meet such a criterion. If the lower test performance persists even after the pandemic, it may even be necessary to update intelligence test norms to account for this effect.

As discussed in the previous section, the present study has several limitations. The results can therefore only be regarded as a first indication that the pandemic is affecting intelligence test performance. There is a need for further research on this topic to corroborate the findings. It is obviously no longer possible to start a longitudinal project with prepandemic measurement points. However, the present article presented a way to investigate the effect of the pandemic if prepandemic comparison samples are available. Ideally, the prepandemic samples would have been assessed shortly before the pandemic onset to minimize differences between cohorts due to the (reverse) Flynn effect, changes in school curricula, or school policy changes. If a sample was assessed very recently before the pandemic, it may also be possible to retest the participants for the investigation of the pandemic effects. Although we cannot make any causal claims with the present study, our results suggest that COVID-19-related problems might extend to students’ cognitive abilities. As intelligence plays a central role in many areas of life, it would be important to further investigate differences between prepandemic and current student samples to account for these differences in test norms and for possible disadvantages by offering specific interventions.


Contrary to the cliché widespread among intellectuals of ordinary people as easily deceived simpletons, humans have an evolutionary rooted distrust of what others say., of all things, this "epistemic vigilance" may be the foundation for delusions

Delusions as Epistemic Hypervigilance. Ryan McKay, Hugo Mercier. Current Directions in Psychological Science, March 8, 2023. https://doi.org/10.1177/09637214221128320

Abstract: Delusions are distressing and disabling symptoms of various clinical disorders. Delusions are associated with an aberrant and apparently contradictory treatment of evidence, characterized by both excessive credulity (adopting unusual beliefs on minimal evidence) and excessive rigidity (holding steadfast to these beliefs in the face of strong counterevidence). Here we attempt to make sense of this contradiction by considering the literature on epistemic vigilance. Although there is little evolutionary advantage to scrutinizing the evidence our senses provide, it pays to be vigilant toward ostensive evidence—information communicated by others. This asymmetry is generally adaptive, but in deluded individuals the scales tip too far in the direction of the sensory and perceptual, producing an apparently paradoxical combination of credulity (with respect to one’s own perception) and skepticism (with respect to the testimony of others).

Epistemic Vigilance

A set of putative cognitive mechanisms serves a function of epistemic vigilance: to evaluate communicated information so as to accept reliable information and reject unreliable information (Sperber et al., 2010). The existence of these mechanisms has been postulated on the basis of the theory of the evolution of communication (e.g., Maynard Smith & Harper, 2003Scott-Phillips, 2008). For communication between any organisms to be stable, it must benefit both those who send the signals (who would otherwise refrain from sending them) and those who receive them (who would otherwise evolve to ignore them). However, senders often have incentives to send signals that benefit themselves but not the receivers. As a result, for communication to remain stable, there must exist some mechanism that keeps signals, on average, reliable. In some species, the signals are produced in such a way that it is simply impossible to send unreliable signals—for instance, if the signal can be produced only by large or fit individuals (see, e.g., Maynard Smith & Harper, 2003). In humans, however, essentially no communication has this property.1 It has been suggested instead that humans keep communication mostly reliable thanks to cognitive mechanisms that evaluate communicated information, rejecting unreliable signals and lowering our trust in their senders—mechanisms of epistemic vigilance.
To evaluate communicated information, mechanisms of epistemic vigilance process cues related to the content of the information (Is it plausible? Is it supported by good arguments?) and to its source (Are they honest? Are they competent?). A wealth of evidence shows that humans possess such well-functioning mechanisms (for review, see, e.g., Mercier, 2020), that they are early developing (being already present in infants or toddlers; see, e.g., Harris & Lane, 2014), and that they are plausibly universal among typically developing individuals. Crucially for the point at hand, these epistemic vigilance mechanisms are specific to communicated information. Our own perceptual mechanisms evolved to best serve our interests, and there are thus no grounds for subjecting their deliverances to the scrutiny that must be deployed for other individuals.
There is now a large amount of evidence that people systematically discount information communicated by others. This tendency has often been referred to as egocentric discounting (Yaniv & Kleinberger, 2000), and it has been observed in a wide variety of experimental settings (for a review, see Morin et al., 2021). For instance, in advice-taking experiments, participants are asked a factual question (e.g., What is the length of the Nile?), provided with someone else’s opinion, and given the opportunity to take this opinion into account in forming a final estimate. Overall, participants put approximately twice as much weight on their initial opinion as on the other participant’s opinion, even when they have no reason to believe the other participant less competent than themselves (Yaniv & Kleinberger, 2000).
The discounting of others’ opinions can be overcome if we have positive reasons to trust them or if they present good arguments—in particular, if our prior opinions are weak (see, e.g., Mercier & Sperber, 2017). However, in the absence of such positive reasons, discounting is a pervasive phenomenon. There is no such systematic equivalent when it comes to perception. Although in some cases we can or should learn to doubt what we perceive (e.g., when attending to the reminder that “objects in mirror are closer than they appear” while driving), this is typically an effortful process with uncertain outcomes. In visual perception, for example, models in which the observer behaves like an optimal Bayesian learner have proven very successful at explaining participants’ behavior (e.g., Geisler, 2011). Even if there are deviations from this optimal behavior (e.g., Stengård & van den Berg, 2019), they do not take the form of a systematic tendency to favor our priors over novel information.
There is thus converging evidence (a) that humans process communicated information differently than information they acquire entirely by their own means and (b) that the former is systematically discounted by default (i.e., in the absence of reasons to behave otherwise, such as reasons to believe the source particularly trustworthy or competent). This, however, leaves open significant questions of great relevance for the present argument. In particular, to what stimuli does epistemic vigilance apply to? Presumably, epistemic vigilance evolved chiefly to process the main form of human communication: ostensive communication, which includes verbal communication but also many nonverbal signals (from pointing to frowning). Related mechanisms apply to other types of communication, such as emotional communication (Dezecache et al., 2013).
What of behaviors that have no ostensive function (e.g., eating an apple) or even aspects of our environment that might have been modified by others (e.g., a book found on the coffee table)? Although such stimuli should not trigger epistemic vigilance by default, they may under some circumstances. One might interpret a friend eating an apple as an indication that the friend has followed health advice to eat more fruit, or one could interpret one’s spouse’s placement of a book on a table as an invitation to read it—whether it was so intended or not. The behavior might then be discounted: We might suspect our friend of eating the apple only for our benefit while privately gorging on junk food.
Other cognitive mechanisms, more akin to strategic reasoning, but bound to overlap with epistemic vigilance, must process noncommunicative yet manipulative information (on the definition of communication vs. manipulation or coercion, see Scott-Phillips, 2008). A detective should be aware that some clues might have been placed by the criminal to mislead her. In some circumstances, therefore, epistemic vigilance and related mechanisms might apply even to our material environments, instead of applying only to straightforward cases of testimony. Still, epistemic vigilance should always apply to testimony, whereas it should apply to perception only under specific circumstances, such that the distinction between these two domains (testimony vs. perception) remains a useful heuristic.
How might these considerations inform our understanding of delusions? Whereas in healthy individuals the scales are adaptively tipped in favor of trusting the perceptual over the ostensive, this imbalance may be maladaptively exacerbated in delusions (Fig. 1). This could be for at least two complementary reasons: Sensory or perceptual evidence may be overweighted, and testimonial evidence may be underweighted. We review each of these possibilities in turn.