Genetic substructure and complex demographic history of South African Bantu speakers. Dhriti Sengupta, Ananyo Choudhury, Cesar Fortes-Lima, Shaun Aron, Gavin Whitelaw, Koen Bostoen, Hilde Gunnink, Natalia Chousou-Polydouri, Peter Delius, Stephen Tollman, F. Xavier Gómez-Olivé, Shane Norris, Felistas Mashinya, Marianne Alberts, AWI-Gen Study, H3Africa Consortium, Scott Hazelhurst, Carina M. Schlebusch & Michèle Ramsay. Nature Communications volume 12, Article number: 2080. Apr 7 2021. https://www.nature.com/articles/s41467-021-22207-y
Popular version: How the language you speak aligns to your genetic origins and may impact research on your health (phys.org)
Abstract: South Eastern Bantu-speaking (SEB) groups constitute more than 80% of the population in South Africa. Despite clear linguistic and geographic diversity, the genetic differences between these groups have not been systematically investigated. Based on genome-wide data of over 5000 individuals, representing eight major SEB groups, we provide strong evidence for fine-scale population structure that broadly aligns with geographic distribution and is also congruent with linguistic phylogeny (separation of Nguni, Sotho-Tswana and Tsonga speakers). Although differential Khoe-San admixture plays a key role, the structure persists after Khoe-San ancestry-masking. The timing of admixture, levels of sex-biased gene flow and population size dynamics also highlight differences in the demographic histories of individual groups. The comparisons with five Iron Age farmer genomes further support genetic continuity over ~400 years in certain regions of the country. Simulated trait genome-wide association studies further show that the observed population structure could have major implications for biomedical genomics research in South Africa.
Discussion
More than 40 million South Africans speak one of the nine major South-Eastern Bantu languages as their first language. Notwithstanding clear divisions in the South-Eastern Bantu language phylogeny and geographic stratification of the speakers, very few studies have investigated the genetic differentiation between SEB groups. Based on a large-scale study of over 5000 participants representing eight of the nine major SEB groups in South Africa, we have demonstrated the presence of a robust fine-scale population structure within the SEB groups, which broadly separates genomes of SEB groups into the three major linguistic divisions (Nguni, Sotho-Tswana, and Tsonga), and also reflects the geographic distribution of LMAs to a large extent. The resolution of this structure within the SEB groups was enhanced considerably by taking ethno-linguistic concordance of individuals and their geographic locations into account. However, it needs to be noted that self-identity itself is complex, with about one third of the participants having more than one parent or grand-parent with a different ethnic self-identity. Moreover, while the PCA and PCA-UMAP shows clear population structure, there are exceptions highlighting the fluidity of cultural identity. Thus, self-selected group-identity encompasses significant group-related genetic variability, and it is important to emphasise that cultural identity and genetic variation are not necessarily aligned. Studies on population structure in South Africa should not be seen as justifying the ethnic nationalism generated by the country’s colonial and apartheid past. Our aim was to explore the role of genetic diversity in explaining population history and in health research. We recognise, and our study shows, that self-identity can involve considerable fluidity and that biological reductionist approaches pose dangers for the interpretation of our findings.
In alignment with results from previous studies10,32, our data also shows that differential Khoe-San gene flow plays a major role in the population structure of SEB groups. However, the persistence of the structure even after accounting for differential Khoe-San admixture suggests the contribution of other demographic factors in the genetic differentiation of these groups. The SEB groups start to show clear divergence in population size dynamics from about 40 generations ago. This timeframe converges with the earliest dates of Khoe-San admixture and probably points at the initiation of migration events that gradually separated these groups. On the other hand, a rather wide variation in Khoe-San admixture dates (spanning ~20 generations) among SEB groups possibly reflects the complexity of the settlement of different parts of the country by the ancestral BS populations. Comparison of present-day SEB groups with Iron-Age farmer genomes provided evidence for genetic continuity in a geographic region in Central-East South Africa for at least the last 300–500 years. Our results, while attesting to the well-known pattern of Khoe-San female-biased gene flow, showed notable differences in the extent of this bias among different SEB groups demonstrating that the nature of interaction between Khoe-San and BS could have varied temporally and geographically.
The dataset we generated for this study has provided a much better contextualization for previously sequenced Iron-Age genomes from Southern Africa. The SEB are unique in Africa, as being among the very few populations that contain considerable gene flow from the Khoe-San. These data therefore are of major importance in terms of understanding the interaction between the Khoe-San and other Southern African populations. They will play an important role in providing insights through comparative analyses once more genetic data from hunter-gatherers and ancient genomes from this geographic region become available.
Our analyses including allele frequency comparisons, genome-wide scans for selection and Khoe-San ancestry distribution show the SEB groups to be highly diverged at certain genomic regions. Based on simulated-trait GWAS, we further illustrate that the fine-scale population structure within the SEB groups could impact a GWAS by introducing a large number of false positives. A combination of cautious study design to minimize geographic and ethno-linguistic biases and stringent measures for population structure correction is therefore recommended for GWASs involving SEB groups. Moreover, while GWAS can address the false positives introduced due to population structure using genomic control, PC or other approaches, it is impossible to identify and control for population structure in candidate gene studies. Therefore, utmost care should be taken during study design to ethnically and geographically homogenise samples in order to control for false positives in association studies using limited markers.
A major limitation of our study is that the sampling sites do not cover the full geographic spread of SEB groups in the country, possibly causing some of the groups to be suboptimally represented in our dataset. Nevertheless, our results suggest that we are at a critical point in history where the population structure is still observable with efficient sampling and in-depth ethno-linguistic characterization, even if it is gradually diminishing due to migration and intermingling between different SEB groups. We hope that our findings will motivate studies with larger sample sizes and wider geographic representation to help unravel the demographic events that contributed to the peopling of South Africa.