ContextSV: A novel computational method for calling structural variants and integrating information
Jonathan Elliot Perdomo PhD Candidate Affiliations: 1. School of Biomedical Engineering, Drexel University, Philadelphia, PA 19104, USA 2. Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA Kai Wang Professor of Pathology and Laboratory Medicine Affiliations: 1. Perelman School of Medicine at the University of Pennsylvania 2. Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA
Poster # 2
Structural variants (SVs) are defined as genomic alterations >50 bp which form the largest source of human genome variation. Identifying SVs associated with clinical phenotypes empowers clinical diagnoses and allows researchers to investigate potential molecular mechanisms. Emerging long-read sequencing platforms provide the resolution required to resolve larger and more complex SVs. Nevertheless, variable error rates in these technologies possibly result in a high false positive rate and low robustness for SV detection. The rich repertoire of available technologies, including short-read sequencing and optical mapping, can be leveraged to resolve these limitations. Here we introduce ContextSV, a novel SV calling method that uses a hybrid approach to improve accuracy and robustness: Long read data is used to identify SV candidates, while short reads yield high-accuracy sites for resolving breakpoints in complex SVs, and optical maps provide long-range scaffolds for high-quality read assembly prior to running SV detection algorithms. To improve accuracy, we train a binary classification model to score candidate SVs based on coverage and genomic context, which are key SV validation features. Scores are used to filter low-likelihood SVs. Finally, we plan to incorporate support for pangenome graph reference formats in ContextSV: A pangenome better represents common haplotypes in the human population relative to a single linear reference genome, and thus would form a more comprehensive reference for SV identification. Large collaborative efforts including the Human Pangenome Reference Consortium (HPRC) aim to release a pangenome representing a large, diverse set of human genome sequences, and thus there is a growing importance for future SV callers to provide graph reference support. In summary, ContextSV enables capturing large, complex SVs with high accuracy and robustness by leveraging information across multiple technologies and using a machine learning model to compute confidence scores, while providing support for future pangenome developments.