Poster #56 - Ali Oku
- vitod24
- Oct 20
- 2 min read
Enhancing Bulk RNA-Seq Deconvolution Using Atlas-Level Deep Learning Embeddings
Ali Oku, Rui Fu, Heather Geiger, Will Liao, Nicolas Robine New York Genome Center
Cell type deconvolution estimates the relative abundances of cell types in bulk RNA-seq data, allowing researchers to infer tissue composition when single-cell RNA sequencing (scRNA-seq) is not feasible. However, conventional tools are often sensitive to batch effects and reference biases. To overcome these limitations, we evaluated deep generative and transformer-based models for more accurate estimation of cell type proportions. We focused on scVI, a probabilistic variational autoencoder (VAE) that learns low-dimensional latent representations while correcting batch effects, and two large-scale transformer models, Geneformer and scGPT, trained on millions of single-cell profiles. Although not specifically designed for batch correction, transformer embeddings can reduce batch effects by capturing biologically meaningful patterns and can be fine-tuned for diverse tasks. To test whether such latent embeddings improve bulk deconvolution, we used single-cell data from the Human Endometrial Cell Atlas (HECA) spanning seven datasets and 87 donors. Pseudobulk profiles were generated by aggregating single-cell counts to simulate bulk data while preserving true cell type proportions. We extracted latent embeddings from scVI (both a pre-trained model trained on ~75 million human cells and a model trained de novo) and from Geneformer (GF-12L-95M-i4096 model, with and without fine-tuning). These embeddings were applied in two strategies. First, we used them directly for Non-Negative Least Squares (NNLS) to estimate cell proportions. Second, we trained random forest regressors on pseudobulk embeddings to predict cell type composition. We compared estimated and true proportions using mean squared error and benchmarked against conventional deconvolution methods. We show that latent embeddings consistently achieved competitive or superior accuracy showing robustness to batch effects and technical noise. These results demonstrate that deep learning latent embeddings can yield promising performance in cell type deconvolution. Notably, embeddings from pre-trained models performed well without additional fine-tuning, highlighting their potential for robust and efficient bulk RNA-seq deconvolution.


Comments