Mohammad Erfan Mowlaei [1], Chong Li [1], Oveis Jamialahmadi [2], Raquel Dias [3], Junjie Chen [4], Benyamin Jamialahmadi [5], Sudhir Kumar [6,7], Timothy Richard Rebbeck [8,9] and Xinghua Shi [1] 1 Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA 2 Department of Molecular and Clinical Medicine, Institute of Medicine, Sahlgrenska Academy, Wallenberg Laboratory, University of Gothenburg, Gothenburg, Sweden 3 Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA 4 Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen University Town, Shenzhen, Guangdong, China 5 David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, CA 6 Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA 7 Department of Biology, Temple University, Philadelphia, PA, USA 8 Division of Population Sciences, Dana-Farber Cancer Institute, Boston, MA, USA 9 Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
Poster # 33
Next-Generation Sequencing enables researchers to sequence the whole genome, compared to a decade ago, cost-effectively in terms of time and expenses. However, many research laboratories still utilize low coverage sequencing and microarray chips to sequence the genome due to the cost. Furthermore, assay failures, genotype calling errors, and differences in sequencing platforms and chips can lead to untyped and missing genotypes. Consequently, most genotype data contain at least some missing data. Genotype imputation methodologies are a suite of computational techniques designed to tackle this issue by predicting missing data genotypes using patterns of sequence variation and linkage-disequilibrium in the reference genomes. We developed a new tool, termed Split Transformer Impute (STI), to impute bi-allelic and multi-allelic missing genotypes. Recently transformers, such as Generative Pre-trained Transformer 4 (GPT4), have proven to yield state-of-the-art performance in a diverse set of problems across different domains such as natural language processing. The idea that omics sequences can be treated as sentences in natural language processing was recently explored in Evolutionary Scale Modeling (ESMFold) for protein design. Along these lines, we have developed STI to achieve genotype imputation with high accuracy. In addition to using an embedding layer, STI forms local sliding windows of genotypes that pass through a series of convolutional and transformer blocks to obtain high accuracy. While convolutions serve to capture local correlations among the SNPs, they are not strong in preserving the information for correlations among distant SNPs. Attention, on the other hand, can complement convolutions to retain long-range information. But it is memory intensive when applied to long sequences, so we utilize the sliding windows to maintain a reasonable memory consumption and allow model scalability. To benchmark STI against existing methods, we designed two experiments. The first experiment evaluates the performance of models for imputing sporadic missing-ness in the data, while the second experiment evaluates the performance for imputing missing marker sites. A yeast dataset and the human 1000 Genomes Project dataset and microarray data were used to carry out these experiments. We found that STI can perform competitively or better than the existing imputation models, including Minimac4, Beagle5.4, and other deep learning methods, in terms of accuracy, imputation quality score, r-squared, and F1-score metrics. Furthermore, our model is capable of imputing multi-allelic SNVs much better than existing deep learning and classical methods. In conclusion, we have introduced a novel deep learning-driven framework designed for accurate genotype imputations. This framework holds promise for predicting missing values and ungenotyped markers in disease association investigations and meta-analytical endeavors.
Comentarios