Using combinatorial hash-joins to identify SARS-CoV-2 variants.
Updated: Sep 29
S Haag1, A Feder2, AM Moustafa3,4, PJ Planet2,4 1-Arcus, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 2- Division of Pediatric Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 3- Division of Gastroenterology, Hepatology & Nutrition, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA 4- Department of Pediatrics, Perelman College of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
The number of SARS-CoV-2 genomic sequences far outweighs all the genomes ever sequenced to date of all other organisms combined, with some 10 million known variants of SARS-CoV-2. Existing phylogenetic techniques to identify variants that have 3 or less allele differences compare each strain to all other strains, resulting in a computational complexity of Q(N2) where N is the number of variants. To solve this problem, we have developed a novel technique that uses combinatorial hash joins to identify variants with 3 or less allele differences. This new algorithm has a computational complexity of Q(N+E), where E is the number of variants pairs that have 3 or less allele differences. This new algorithm reduced processing times from > 10 days to < 1 hour for the SARS-CoV-2 problem. This work can be applied to other organisms with similar expected reductions in running times.