Understanding common and distinct information in paired multiomic data with Tilted-CCA
Updated: Sep 29, 2022
Dr. Kevin Lin*, Dr. Nancy R. Zhang* *Wharton Statistics and Data Science, University of Pennsylvania
Paired multiomic single-cell datasets of two modalities (for example, RNA and protein) offer biologists insight on cell-type identification at single-cell resolution via existing dimension-reduction methods that aggregate information from both modalities, but there lacks methods that formalizes what type of information is either common to both modalities or distinct to each modality. We develop a new dimension-reduction method called Tilted-CCA to fill this gap, where we extract geometric information based on the nearest-neighbor graphs and build upon the statistical foundation of Canonical Correlation Analysis (CCA). Tilted-CCA decomposes each modality's expression matrix into a common embedding (representing expression patterns that are shared between the two modalities) and a distinct embedding (representing expression patterns that are unique to each modality). This common embedding encapsulates the "intersection of information," where cells are separated in this embedding only if the cells are separable in both modalities. We demonstrate the downstream utility of Tilted-CCA for multiomic data of different modalities and biological systems. First, for RNA+Protein data like CITE-seq, we show that Tilted-CCA offers insight on designing the smallest antibody panel where the selected surface antibody markers provide informative cell-type separation patterns not reflected in RNA. Second, for RNA+ATAC data like the 10x multiome applied on developmental systems, we show that the common embedding reveals which cells are in a terminal cell-state (based on how synchronized the RNA and ATAC modalities are) and clarifies the relation between a gene's transcription and the chromatin assessability of its cis-regulatory regions.