ScatTR: Monte Carlo Sampling for Long Tandem Repeat Length Estimation
Al-Abri, R., B.S. (Department of Computer Science, Columbia University) Gürsoy, G., Ph.D. (Department of Computer Science, Columbia University; New York Genome Center)
Poster # 61
Accurately determining the length of tandem repeats (TRs)-tracts of repeating sequences found ubiquitously across the genome-is a crucial task in genomic research. TRs are implicated in over 30 diseases, such as Frederich's Ataxia, autism, and cancer. The expansion of TRs disrupts genomic stability and potentially alters gene functions. Therefore, precise estimation of TR lengths is essential for deepening our understanding of disease pathogenesis. Current methods, such as ExpansionHunter and STRling, are successful at predicting the lengths of short TRs. However, these methods face challenges when TRs are much longer than a typical fragment length in whole-genome sequencing. To address the challenges associated with estimating the length of long TRs, we introduce scatTR, which models read alignments as solutions from a probability distribution using the Monte Carlo sampling technique. Our algorithm uses an efficient data structure to randomly align reads to plausible positions and accept moves based on an energy function proportional to the distance to the expected sequencing signal distribution. This allows us to estimate the length of a repeat expansion from the highest-scoring solution identified by running the algorithm in parallel on decoy references with varying lengths of the repeat. In contrast to deterministic approaches, this algorithm embraces uncertainty to discover likely true alignment solutions. As a result, our preliminary data show that scatTR outperforms existing methods on simulated data. We compared our tool against the state-of-the-art, STRling and ExpansionHunter, and showed that scatTR can accurately estimate the length of repeat expansions up to 15 Mbp long while other tools are limited to 300-600 bp. ScatTR offers a novel solution for the estimation of long TR lengths, addressing a significant challenge in bioinformatics. Its application can enhance our understanding of the role of TRs in regulation and various diseases, and potentially inform therapeutic strategies.
LIGHTNING TALK - 2023 MidAtlantic Bioinformatics Conference