Jason Lamanna 1, Mohammad Erfan Mowlaei 2, Paul English 3, Xinghua Shi 2, Vincenzo Carnevale 1,4 1 iGEM-Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA 2 Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA 3 ARUP Labs, University of Utah, Salt Lake City, UT, USA 4 Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA
Poster # 46
Evolutionary relationships between proteins from the same orthologous family constrain their sequence variability to intricate mutational patterns that can be learned by deep models. Understanding this "functional imprint" has wide-reaching implications spanning from molecular biology to drug discovery, as it bridges the gap between sequence and function. However, assessing the quality of these models is often a challenging task given the paucity and limited size of annotated datasets. A commonly used approach to overcome this limitation is to evaluate the generative capacity, i.e., a model's ability to hallucinate sequence ensembles that are statistically indistinguishable from naturally occurring ones. In this regard, two architectures have stood out in recent years: Transformer and Variational AutoEncoder (VAE). While the former is particularly effective at capturing complex statistical dependencies between distinct positions along the sequence, the latter is based on a low-dimensional representation which embodies the sequence-function relationship in an easy to interpret way. Here we present a new model for protein sequence generation (ProGenAT) that combines the best of both worlds: a VAE endowed with adversarial training based on a Transformer discriminator. We show that ProGenAT outperforms state-of-the-art protein generative models and produces evolutionarily meaningful low-dimensional representations of protein sequences. Our work paves the way for rational design of protein sequences and phylogenetically informed protein sequence annotation.
Comentários