Sparse PCA resolves Drosophila intestinal stem cell genomics in scRNA-seq
Daniel Ringwalt, Department of Biology, Krieger School of Arts and Sciences, Johns Hopkins University, Baltimore, MD Zachary Lubberts, University of Virgina Department of Statistics
Poster # 74
The midgut of Drosophila experiences turnover and replenishment of tissue cells from a reservoir of intestinal stem cells (ISCs). This is a key system for the study of somatic stem cells, and identifying ISCs and their enteroblast (EB) progeny in single-cell RNA-seq (scRNA-seq) will reveal a genomic profile of this system. Recovering ISC and EB identities is a notably ambiguous challenge in single-cell data. To study data preparation in single-cell genomics, we present our reanalysis of a batch of midgut cells in scRNA-seq. The authors noted that the cells expressing ISC and EB marker genes (in this particular technical replicate and technology) are generally not separable.We applied a sparsity constraint (Sparse PCA, SPCA) on the feature loadings of each principal component (SPC) of the log-normalized data. Computing the exact SPC to maximize explained variance is an NP-hard problem, but we show that approximately solving this through a heuristic approach still yields improvement over traditional PCA. Our SPCs are nonnegative and may be presented as small gene sets, making them much more interpretable than traditional principal components. As an example of the utility of our approach, we show that the EB cell type is explained by an SPC including Notch, which is known to induce EB differentiation, and at least 3 Notch-correlated genes, along with a few others. This gives enhanced quantification of Notch using our approach with the SPCA constraint. The SPCA model, retaining only 237 genes in clustering, enhances DE log-fold change magnitude of other marker genes (e.g. Delta, peb), and diminishes significance of unexpected genes (e.g. amylase) in ISC and EB, compared to any PCA model tested (using 2,000 genes). We showed that highly sparse dimensionality reduction is a promising strategy for resolving ambiguous identities in single-cell data.