Cladebreaker: Using proteomic novelty to test clonality in outbreaks and epidemics
Updated: Sep 29, 2022
A Feder1, AM Moustafa2,3, PJ Planet1,3 1- Division of Pediatric Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 2- Division of Gastroenterology, Hepatology & Nutrition, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 3- Department of Pediatrics, Perelman College of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
Genomic surveillance for emerging diseases, transmission events, epidemics and outbreaks is becoming the gold standard for molecular epidemiology. Whole genome phylogenetic analysis is the primary method for inferring clonality (monophyly) of outbreaks and transmission patterns, but criteria for testing these inferences remain unclear. One approach is to include existing genomes from public databases to test relationships inferred in the phylogeny. If genomes not associated with the outbreak or transmission event "break up" relationships in the tree by branching within putative outbreak clades, then the observed outbreak may not be clonal. With large genomic databases it may not be clear which genomes to add to a phylogenetic analysis and including all genomes can become extremely computationally expensive. We propose that the best way to test the hypothesis of clonality is to use the most similar genomes available in the database. If these genomes fail to break up the monophyly of the outbreak clade, then this provides the strongest evidence possible for clonality. Here we present an application called Cladebreaker that uses the topgenome function of the WhatsGNU application to quickly identify the most similar genomes, using a measure of protein-level novelty, to each of the genomes under investigation. Cladebreaker is a nextflow pipeline encompassing multiple tools. It takes in sequence reads, finds the most similar genomes from a database, and runs a full sequence-based phylogenetic analysis based on a reference-based SNP matrix or concatenated amino acids. The output is a phylogenetic tree containing both the best-hit genomes and the query genomes.