• mabc307

LongReadSum: A fast and flexible quality control tool for long-read sequencing data

Updated: Sep 29

[1] Jonathan Elliot Perdomo, BA. Children's Hospital of Philadelphia. [2] Mian Umair Ahsan, MS. Children's Hospital of Philadelphia. [3] Qian Liu, Ph.D. Children's Hospital of Philadelphia. [4] Li Fang, Ph.D. Children's Hospital of Philadelphia. [5] Kai Wang, Ph.D. Children's Hospital of Philadelphia.

Recent advances in long read sequencing technologies have generated reads tens to thousands of kilobases long with high accuracy. These long reads have a broad range of applications in genomics, including but not limited to uncovering previously missed genetic causes of human diseases, detecting variants in difficult-to-map regions, assembling repetitive regions of the human, and identifying novel splicing isoforms. Prior to analyzing long read data, quality control (QC) checks are needed to ensure that the raw data from sequencers does not contain substantial errors and biases, and to understand basic characteristics of the sequencing run. While short-read sequencing technologies have well-established QC tools such as FastQC, there have been several challenges in implementing QC frameworks for long reads. First, long read data can be generated by different types of sequencing technologies (such as PacBio and Oxford Nanopore) with unique data formats, and the currently available QC tools usually only support one specific sequencing platform, or only support some data formats. Second, long read data can be orders of magnitude larger than short reads (for example, a single flowcell on Nanopore platform can generate 5TB of signal data), and thus it remains imperative to develop a tool that enables fast, high-throughput, and comprehensive QC summary statistics of reads. LongReadSum is a tool that addresses these challenges: It supports the four main data format types used across sequencing technologies (FASTA, FASTQ, FAST5, unaligned BAM and aligned BAM) and can generate a comprehensive summary of different aspects of sequencing data in a timely manner by executing programs in a flexible multi-threaded C++ framework. Outputs are compiled into both a static and a dynamic HTML report customizable to the user's needs which contains basic statistics such as the total number of reads, base pairs, maximum, mean, and median read length, percent guanine-cytosine (GC) content and the N50. The report also includes histogram plots of read length, base quality across reads, and average base quality per read. In addition, for Nanopore reads, we include platform-specific QC measures such as sequencing throughput by time. These statistics provide a comprehensive overview of all major aspects of read quality, enabling the identification of significant errors that may preclude downstream analyses. In conclusion, LongReadSum is a computational tool for fast, comprehensive, and high throughput long read QC with support for all major sequencer data types, and it can be adapted to custom long-read sequencing pipelines.

0 views0 comments

Recent Posts

See All