Poster #10 - Lannawill Caruth
- vitod24
- Oct 20
- 2 min read
Training Breast Cancer Genetic Association Models using biobank scale data for effective risk prediction.
Lannawill Caruth1,2, Roukiatou Sore3, Sarah Ehsan3, Penn Medicine BioBank, Regeneron Genetics Center, Anne Marie McCarthy3*, Shefali Setia-Verma1,2* 1.Perelman Sch. of Med. Inst. of BioMed. Informatics, Philadelphia, PA 2. Department of Pathology and Laboratory Medicine, Perelman School of Medicine, Philadelphia, PA, United States 3. Perelman Sch. of Med. Dept. of Biostatistics Epidemiology & Informatics, Philadelphia, PA
Breast cancer is the most common cancer among women in the United States, aside from skin cancer. Early detection and treatment significantly reduce mortality rates, highlighting the importance of effective risk prediction methods. Traditional GWAS methods fail to capture non-linear associations across the genome, limiting risk prediction abilities. Deep learning methods present an opportunity to utilize non-linear connections in risk prediction efforts. : In this study, we attempt to train a neural network to predict breast cancer case-control status. We perform association testing using SAIGE within the All of Us Biobank across a cohort consisting of women of all ancestries between ages 18-65. Initially, this produced a cohort of greater than 180,000 controls and cases. This association yielded >1000 SNPS with a significant association with our defined breast cancer phenotypes. These SNPs are then used to pre-initialize weights in a neural network predicting breast cancer case-control status within the eMERGE Network. Based on the number of cases (n=4220), we performed subsampling in our control (n=28242) population to better reflect the prevalence of breast cancer in the United States. The model was then trained for 50 epochs using dropout layers and batch normalization for regularization. The model yielded promising performance metrics with an AUC of 0.73 in the validation set and an AUC of 0.69 in the test set. We will next train and evaluate GWANN on ancestry-stratified cohorts, with a specific focus on enhancing predictive performance in populations historically excluded from genetic research. This will include targeted efforts in non-European ancestry groups, where existing risk models often don't perform well. By embedding biological plausibility into the model architecture and leveraging richly annotated variant data, our approach moves beyond traditional black-box models-offering a transparent, interpretable framework for genetic risk prediction.


Comments