Poster #17 - Laura Schultz
- vitod24
- Oct 20
- 2 min read
A robust machine learning approach for genotyping the 17q21.31 inversion polymorphism across diverse ancestral populations.
Schultz, L.M. (PhD), Quinto-Cortés, C.D. (PhD), Montserrat, D.M. (PhD), Ioannidis, A. (PhD), Lanzagorta, N. (MS), Bustamante, C.D. (PhD), Jacquémont, S. (MD), Nicolini, H. (MD, PhD), Glahn, D.C. (PhD), and Almasy, L. (PhD)
Inversions are relatively understudied structural variants that are increasingly recognized for their contributions to human phenotypes. A 17q21.31 inversion (inv17_007) is linked to brain morphology and neuropsychiatric disorders. Previous inv17_007 association studies have focused on EUR populations because existing SNP-based methods for inferring inversion status at scale (e.g., scoreInvHap) yield inaccurate calls for other populations. Hence, we devised a method that can be used to robustly assign inversion genotypes across all ancestral populations. First, we identified a stable set of 240 biallelic SNPs within the inv17_007 region of samples genotyped for HapMap3, 1000G, HGDP, UKBB, and EPIMex and harmonized all SNPs to build hg19. Then, we split the samples into 5 continental ancestry groups and ran PCA for each group. We trained and cross-validated ancestry-specific 2-PC linear support vector machine (SVM) models using 592 EUR, 486 AFR, 197 SAS, 521 EAS, and 202 AMR reference samples with known inv17_007 genotypes. All 5 single-ancestry models classified H1/H1 (homozygous non-inverted), H1/H2 (heterozygous), and H2/H2 (homozygous inverted) genotypes with 100% cross-validated accuracy. We used these ancestry-specific models to infer inv17_007 genotypes for the remaining reference samples and the UKBB and EPIMex samples. Analysis of 650 trios from 1000G and EPIMex yielded no Mendelian errors, and the inv17_007 genotypes we inferred for the EUR-ancestry UKBB individuals agreed with the results we obtained using scoreInvHap. Given that the UKBB and EPIMex inv17_007 genotypes inferred by a subsequent multi-ancestry SVM model agreed perfectly with those inferred by the ancestry-specific models, our curated set of 240 SNPs and inferred inv17_007 genotypes for 3431 publicly available reference samples enables fast, accurate inv17_007 genotyping of biobank-scale cohorts via PCA and SVM with no need to classify individuals into discrete ancestry groups.


Comments