GenoDiffusion: Conditional Denoising Diffusion Model for Genomic Data Augmentation
Chen Song, Phd Student, Computer and Information Science Department, Temple University Yuzhou Chen, Phd, Computer and Information Science Department, Temple University Huanmei Wu, Phd, College of Public Health, Temple University Xinghua Shi, Phd, Computer and Information Science Department, Temple University
Poster # 80
While a large number of genomic sequences are quickly available thanks to recent bio-technological advances, there are still challenges in sharing and releasing such data for public access. Some major challenges include that genomic data is typically imbalanced or biased caused by disease rareness and test affordability, and it is difficult to share genomic data due to concerns about privacy, security and consents. To address these challenges, we introduce a novel conditional denoising diffusion model, namly GenoDiffusion, to enhance genomic data by generating realistic synthetic data which is balanced and free to share. Specifically, the proposed GenoDiffusion achieves this by utilizing conditional denoising diffusion models which learn the underlying distribution to capture complex dependencies among features in input data. By leveraging the original genomic data as input, our proposed GenoDiffusion can generate new synthetic data with similar population structures, variant frequency distributions, and linkage disequilibrium patterns. Extensive experimental results demonstrate that GenoDiffusion outperforms existing methods on multiple genomics datasets including genotypes from the 1000 Genomes Project and genotypes for patients with prostate cancer from The Cancer Genome Atlas (TCGA).