Sensitive and Efficient Pangenome Construction through Alignment-Free Residue Pangenome Analysis
Updated: Sep 29, 2022
Arnav Lal 1, Andries Feder 2, Ahmed Moustafa 3,4, and Paul J. Planet 2,3,5 1 School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA 2 Division of Pediatric Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 3 Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA 4 Division of Gastroenterology, Hepatology & Nutrition, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA 5 Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
Most pangenome analysis relies on large-scale alignments of genes or proteins, which makes them computationally expensive. We found that protein sequences can be transformed into one dimensional twenty-value vectors (vector of residue counts; vRC), where each value is the count of each of the 20 possible amino acids in the protein. The vRC serves as unique protein identifier that makes mathematical comparisons much more tractable and retains information about homology. We use these vectors to develop an alignment-free method for pangenome clustering named Alignment-Free Residue Pangenome Analysis (ARPA; https://github.com/Arnavlal/ARPA). We test this method with large datasets of whole genome prokaryotic sequences and compare to an alignment-based pangenome approach (Roary). Our alignment-free approach yields much faster homolog clustering than BLAST-based alignment. While speed depends upon the dataset composition, the first version ARPA is nearly 100 times faster than Roary at clustering hundreds of intra-species genomes. A beta version with dramatically improved efficiency can even cluster 5,385 S. aureus genomes in 175 seconds on a personal computer. ARPA retains significant clustering accuracy (rand score > 0.9999 and similar clustering metric scores). The specificity of ARPA compares with BLAST-based approaches to defining homologs, with approximately 88% sensitivity and 99.8% specificity. In addition to homolog clustering, vRCs can be used to separate paralogs (based on gene neighborhood) to identify orthologs. The pangenome generated by ARPA offers an opportunity for quick calculation and visualization of allele-level resolution at pangenomic scale. This allelic diversity can be used for efficient and accurate phylogenetic inference. ARPA is an efficient and accurate platform for large-scale, pangenome alignment projects that can be used in comparative and phylogenetic analysis.