Predicting Mental Health From Followed Accounts on Twitter. Cory Costello; Sanjay Srivastava; Reza Rejaie; Maureen Zalewski. Psychology (2021) 7 (1): 18731. https://doi.org/10.1525/collabra.18731
Rolf Degen's take: "The accounts you follow on Twitter could shed some light on your mental health. https://t.co/iLPFnX3LCZ https://t.co/wVX3H6ngxm"
Abstract: The past decade has seen rapid growth in research linking stable psychological characteristics (i.e., traits) to digital records of online behavior in Online Social Networks (OSNs) like Facebook and Twitter, which has implications for basic and applied behavioral sciences. Findings indicate that a broad range of psychological characteristics can be predicted from various behavioral residue online, including language used in posts on Facebook (Park et al., 2015) and Twitter (Reece et al., 2017), and which pages a person ‘likes’ on Facebook (e.g., Kosinski, Stillwell, & Graepel, 2013). The present study examined the extent to which the accounts a user follows on Twitter can be used to predict individual differences in self-reported anxiety, depression, post-traumatic stress, and anger. Followed accounts on Twitter offer distinct theoretical and practical advantages for researchers; they are potentially less subject to overt impression management and may better capture passive users. Using an approach designed to minimize overfitting and provide unbiased estimates of predictive accuracy, our results indicate that each of the four constructs can be predicted with modest accuracy (out-of-sample R’s of approximately .2). Exploratory analyses revealed that anger, but not the other constructs, was distinctly reflected in followed accounts, and there was some indication of bias in predictions for women (vs. men) but not for racial/ethnic minorities (vs. majorities). We discuss our results in light of theories linking psychological traits to behavior online, applications seeking to infer psychological characteristics from records of online behavior, and ethical issues such as algorithmic bias and users’ privacy.
Keywords:emotions, social network analysis, online social networks, machine learning, data science, mental health
Discussion
Our central aim was to understand how mental health is reflected in network connections in social media. We did so by estimating how well individual differences in mental health can be predicted from the accounts that people follow on Twitter. The results showed that it is possible to do so with moderate accuracy. We selected models in training data using 10-fold cross-validation, and then we estimated the models’ performance in new data that was kept completely separate from training, where model Rs of approximately .2 were observed. Although these models were somewhat accurate, when we examined which features were weighted as important for prediction, we did not find them to be readily interpretable with respect to prior theories or broad themes of the mental health constructs we predicted.
Mental Health and the Curation of Social Media Experiences
This study demonstrated that mental health is reflected in the accounts people follow to at least a small extent. The design and data alone cannot support strong causal inferences. One interpretation that we find plausible is that the results reflect selection processes. The list of accounts that a Twitter user follows is a product of decisions made by the user. Those decisions are the primary way that a user creates their personalized experience on the platform: when a user browses Twitter, a majority of what they see is content from the accounts they previously decided to follow. It is thus possible that different mental health symptoms affect the kind of experience people want to have on Twitter, thus impacting their followed-account list. The straightforward ways this could play-out that we discussed at the outset of this paper – e.g., face-valid information-seeking via mental health support or advocacy groups, homophily (following others who display similar mental health symptoms), or emotion regulation strategies – did not seem to be supported. Instead, the accounts with high importance scores were celebrities, sports figures, media outlets, and other people and entities from popular culture. In some rare instances, these hinted towards homophily or a similar mechanism: for example, one account with a high importance score for depression was emo-rapper Lil Peep, who was open about his struggles with depression before his untimely death. More often, however, the connections were even less obvious, and few patterns emerged across the variety of highly important predictors. Other approaches, such as qualitative interviews or experiments that manipulate different account features, may be more promising in the future for shedding light on this question.
Causality in the other direction is also plausible: perhaps following certain accounts affects users’ mental health. For example, accounts that frequently tweet depressing or angry content might elicit depression or anger in their followers in a way that endures past a single browsing session. The two causal directions are not mutually exclusive and could reflect person-situation transactional processes, whereby individual differences in mental health lead to online experiences that then reinforce the pre-existing individual differences, mirroring longitudinal findings of such reciprocal person-environment transactions in personality development (Le et al., 2014; Roberts et al., 2003). Future longitudinal studies could help elucidate whether similar processes occur with mental health and social media use.
In a set of exploratory analyses, we probed the extent to which the predicted scores were capturing specific versus general features of psychopathology. The followed-account scores that were constructed to predict anger captured variance that was unique to that construct; but for depression, anxiety, and post-traumatic stress, we did not see evidence of specificity. One possible explanation is that followed accounts primarily capture a more general psychopathology factor (Lahey et al., 2012; Tackett et al., 2013) but anger also has distinct features that are also relevant. Another possibility is that followed accounts can distinguish between internalizing and externalizing symptoms, and anger appeared to show specificity since it was the only externalizing symptom we examined. The present work cannot distinguish between these possibilities, but future work including more externalizing symptoms may be helpful in differentiating between these and other possibilities.
Relevance for Applications
What does this degree of accuracy – a correlation between predicted and observed scores of approximately .2 – mean for potential applications? First, it’s worth noting that our conclusions are limited to twitter users that meet our minimal activity thresholds (25 tweets, 25 followers, 25 followed accounts), so they may not be applicable to twitter users as a whole, including truly passive users that might follow accounts but not tweet (at all). Even among the users that do meet these thresholds, we do not believe these models are accurate enough for use in individual-level diagnostic applications, as they would provide a highly uncertain, error-prone estimate of any single individual’s mental health status. At best, a correlation of that size might be useful in applications that rely on aggregates of large amounts of data. For example, this approach could be applied to population mental health research to characterize trends in accounts from the same region or with other features in common.
A caveat is that the goal of the present study was to focus on followed accounts – not to maximize predictive power by using all available information. It may be possible to achieve greater predictive accuracy by integrating analyses of followed accounts with complementary approaches that use tweet language and other data. In addition, more advanced approaches that would be tractable in larger datasets, such as training vector embeddings for followed accounts (analogous to word2vec embeddings; Mikolov et al., 2006), could help increase accuracy and should be investigated in the future. Likewise, it may be possible to leverage findings from recent work identifying clusters or communities of high in-degree accounts (Motamedi et al., 2020, 2018) to identify important accounts or calculate aggregate community scores, as opposed to the bottom-up approaches to filtering and aggregating accounts used in this study. Future work can examine the extent to which these different modifications to our procedure maximize predictive accuracy.
Another important caveat to consider with respect to possible applications of this work is that this approach is more suited to studying more stable individual differences in mental health rather than dynamic, within-person fluctuations or responses to specific events. This was an aim that was reflected in the design of this study – for example, the wording of the mental health measures covered a broader time span than just the moment of data collection. Followed accounts are likely to be a less dynamic cue than other cues available on social media (e.g., language used in posts). This is not to say that network ties are unrelated to dynamic states entirely, and that possibility could be explored with different methods. For example, rather than focusing on whether accounts are followed or not, researchers could use engagements with accounts (such as liking or retweeting) to predict momentary reports of mental health symptomatology, or they could track users over time to measure new follows added after an event. The present work can only speak indirectly to these possibilities, but exploring approaches that dynamically link network ties to psychological states is a promising future direction for this work.
The present results, and the possibility of even higher predictive accuracy or greater temporal resolution with more sophisticated methods, raise important questions about privacy. The input to the prediction algorithm developed in this paper – a list of followed accounts – is publicly available for every Twitter user by default, and it is only hidden if a user sets their entire account to “private.” It is unlikely that users have considered how this information could be used to infer their mental health status or other sensitive topics. Indeed, even people who deliberately refrain from self-disclosing about their mental health online may be inadvertently providing information that could be the basis of algorithmic estimates, a possibility highlighted by the often less-than-straightforward accounts that the algorithms appeared to use in their predictions. With time, technological advancement, and research, these predictions might become even more accurate using similarly non-obvious cues in their predictions, though we cannot say how much more. In this way, the present findings are relevant for individuals to make informed decisions about whether and how to use social media. Likewise, they speak to broader issues of ethics, policy, and technology regulation at a systemic level (e.g., Tufekci, 2020). The possibility of a business, government, or other organization putting their considerable resources into using public social media data to construct profiles of users’ mental health may have useful applications in public health research, but it simultaneously raises concerns about how that may be misused. Our results suggest that accuracy is too low for such utopic or dystopic ends presently, but they highlight the possibilities, and the need for in-depth discussions about data, computation, privacy, and ethics.
Predictive Bias
Predictive algorithms can be biased with respect to gender, race, ethnicity, and other demographics, which can create and reinforce social inequality when those algorithms are used to conduct basic research or in applications (Mullainathan, 2019). When we probed for evidence of predictive bias for gender, we found somewhat inconclusive results. There was more of a pattern of bias in the smaller holdout dataset than in the combined data. In the holdout data, women showed higher observed levels of internalizing symptoms (depression, anxiety, and post-traumatic stress) than men with the same model-predicted scores. In the larger combined dataset, only post-traumatic stress showed this effect, and to a much smaller magnitude. Confidence bands in both datasets often ranged from no effect to moderately large effects in one or both directions. All together, we took this as suggestive but inconclusive evidence that the models may have been biased. If the pattern is not spurious, one possible reason may stem from the fact that the sample had more men than women. If men’s and women’s mental health status is associated with which accounts they follow, but the specific accounts vary systematically by gender, then overrepresentation of men in the training data could have resulted in overrepresentation of their followed accounts in the algorithm.
We found little to no evidence of bias with respect to race or ethnicity. The relative lack of bias is initially reassuring, but it should be considered alongside two caveats. First, it is possible that there is some amount of bias that we were unable to detect with the numbers of racial and ethnic minority participants in this dataset. This possibility is highlighted by the confidence bands, which (like gender) tended to range from no effect to moderately large effects. Second, it is possible that collapsing into White vs. non-White is obscuring algorithmic bias that is specific to various racial and ethnic identities. Our decision to combine minority racial and ethnic groups was based on the limitations of the available data, and it necessarily collapses across many substantively important differences.
In any future work to extend or apply the followed-accounts prediction method we present in this study, we strongly encourage researchers to attend carefully to the potential for algorithmic bias. We also hope that this work helps demonstrate how well-established psychometric methods for studying predictive bias can be integrated with modern machine learning methods.
Considering Generalizability At Two Levels of Abstraction
To what extent would the conclusions of this study apply in other settings? There are at least two ways to consider generalizability in this context. The first form of generalizability is a more abstract one, associated with the approach. Would it be possible to obtain similar predictive accuracy by applying this modeling approach to new data drawn from a different population, context, or time, developing a culturally-tuned algorithm for that new setting? We believe the results are likely to be generalizable in this sense. We used cross-validation and out-of-sample testing to safeguard against capitalizing on chance in estimates of accuracy. If the general principle holds that Twitter following decisions are associated with mental health, we expect that it would be possible to create predictive algorithms in a similar way in other settings.
A second, more specific way to think about generalizability is whether the particular prediction algorithms we trained in this study would generalize to entirely new samples from different settings. This is a much higher bar, and we are more skeptical that the models trained in this study would meet it. The fact that the models were not interpretable suggests that they may not have been picking up on theoretically central, universally relevant features of psychopathology. Instead, they might be picking up on real, but potentially fleeting, links between psychopathology and Twitter behavior. By analogy, consider differences between a self-report item like, “I frequently feel sad,” and an item like, “I frequently listen to Joy Division.” The first item would probably be endorsed by depressed people in a wide variety of contexts, populations, and historical eras. The second item, however, is deeply culturally embedded – it if is reflective of depression at all, that association would be highly specific to a particular group of people at a particular cultural moment. Even setting aside that Twitter itself is a product of a specific cultural and historical context, our inspection of the followed accounts suggests that they are not reflecting enduring features of psychopathology in a direct, face-valid sense. The associations with particular accounts were real in this data, but as cultural trends change, they may fade while new ones emerge.
Our results cannot speak to this form of generalizability directly, and it would require a new sample and different design to effectively speak to this. One possibility would be to collect several very different samples (e.g., sampled in different years), train models with each, and then evaluate cross-sample predictive accuracy. This would be a much stricter test of accuracy, but it would provide better justification for using model-derived scores in research or application. Such an approach might also be useful for distinguishing which accounts or features of accounts are predictive because of fleeting cultural factors, and which ones reflect stable and cross-contextually consistent associations with psychopathology.
No comments:
Post a Comment