Improving Genomic Data Diversity Using Few-shot Generative Domain Adaptation
Updated: Sep 29, 2022
Chen Song, Computer and Information Science Department, Temple University Emily Thyrum, Computer and Information Science Department, Temple University Dr. Xinghua Shi, Computer and Information Science Department, Temple University
Despite recent advances in generating large-scale genomic sequences, human genomic data still suffer from the lack of diversity due to various factors including disease rareness and test affordability. For example, the majority of genomic data available come from populations with European ancestry and data from other populations are scarce. Hence, this study develops a novel deep learning model based on Generative Adversarial Networks (GANs) to augment genomic data from underrepresented groups via transferring knowledge learned from genomes in populations with majority of data. The main challenge in doing so is to synthesize highly-realistic and diverse data under limited supervision. In this regard, we deploy a few-shot transfer learning strategy to adapt a pretrained model trained on a majority population into another minority population. In particular, we train a semi-supervised GAN stacked with a sequential of convolutional layers to capture the underlying pattern of genomes from the majority population. We adapted a pretrained model to the genomic data from the minority population by freezing the well-learned hidden layers and fine-tuning the output head in discriminator and generator, respectively. Furthermore, a truncation strategy was implemented to constrain the hidden space to mitigate the optimization problem caused by a lack of training samples. We experimentally evaluated the proposed approach by significantly increasing the diversity of prostate cancer genotype data using The Cancer Genome Atlas (TCGA) datasets. Our results showed that the proposed method can generate synthetic data similar to the real data. We anticipate that the diversity of genomic data improved from the generated data using our method to enhance the performance of machine learning models toward precision medicine for all.