top of page
Search

Poster #30 - Yizhou Tracy Wang

  • vitod24
  • Oct 20
  • 2 min read

Billion-level TCR clustering analysis with GIANAplus


Yizhou Tracy Wang (PhD Student, University of Pennsylvania Perelman School of Medicine), David Chen (High School Student, Tower Hill School), Bo Li (PhD, The Children's Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine)


T cells play a critical role in adaptive immunity, while T cell receptors (TCRs) determine their specificity to antigens in the form of peptide-MHC (pMHC). TCRs are highly diverse (theoretically around 1015 possible sequences across individuals), enabling them to recognize the diverse and constantly evolving antigens from cancer and pathogens. Despite this diversity, research has shown that TCRs sharing similar amino acid sequences tend to recognize the same antigen. Understanding TCR sequence based clustering behavior across large populations can thus enable early disease detection and guide the development of immunotherapies for cancer and autoimmune diseases with direct clinical and translational impacts. There exist billions of TCR sequences generated with multiple technologies, and this number keeps increasing rapidly with new TCR sequencing technologies being developed. However, the full potential of this large and growing number of data has yet to be realized owing to the lack of TCR clustering tools that are computationally tractable at the scale of billions of sequences; with Smith-Waterman (SW) alignment based methods, it takes approximately 320 years to align 100 million sequences. GIANA is by so far one of the fastest TCR clustering tools available with performance comparable to state-of-the-art tools as benchmarked by a third party. The high speed of GIANA is achieved by isometrically embedding TCR sequences in a high-dimensional Euclidean space followed by k-mer guided Smith-Waterman alignment, reducing the computational complexity from quadratic pair-wise comparison to logarithmic time. TCR repertoire analysis with GIANA can group individuals together based on their cancer status and COVID-19 history. However, it is still intractable for GIANA to cluster sequences on the scale of billions. We therefore propose GIANAplus, with an improved isometric embedding that is a near-exact estimate of the BLOSUM62 matrix achieved by iterative application of multidimensional scaling, eliminating the need for SW alignment and increasing clustering speed by a factor of 30. In addition, we developed a novel gap-allowing mathematical transformation to handle TCRs with different lengths. With further computational and mathematical improvements, we expect GIANAplus to have comparable accuracy with other state-of-the-art methods, while making it possible to integrate TCR sequencing data on the scale of billions within days.

 
 
 

Recent Posts

See All
Poster #9 - Yuheng Du

Cell-Type-Resolved Placental Epigenomics Identifies Clinically Distinct Subtypes of Preeclampsia Yuheng Du, Ph.D. Student, Department of Computational Medicine and Bioinformatics, University of Michig

 
 
 
Poster #15 - Jiayi Xin

Interpretable Multimodal Interaction-aware Mixture-of-Experts Jiayi Xin, BS, PhD Student, University of Pennsylvania, PA, USA Sukwon Yun, MS, PhD Student, University of North Carolina at Chapel Hil

 
 
 
Poster #14 - Aditya Shah

Tumor subtype and clinical factors mediate the impact of tumor PPARɣ expression on outcomes in patients with primary breast cancer. Aditya Shah1,2, Katie Liu1,3, Ryan Liu1, 4, Gautham Ramshankar1, Cur

 
 
 

Comments


bottom of page