Ambient RNA in scRNA-seq and how to remove it

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If you’ve ever found T cell markers bleeding into monocyte clusters or insulin transcripts mysteriously showing up outside beta cells, you’ve met ambient RNA. It’s the invisible fog that settles over droplet-based single‑cell RNA‑seq (scRNA‑seq) data and, left unchecked, it nudges clusters together, mutes marker specificity, and throws off downstream biological stories. Fortunately, we don’t have to accept that uncertainty. In this post, we unpack what ambient RNA is, why it matters, and how DecontX decontaminates your count matrix with a practical, model‑based approach. We’ll close with a short code example and a quick tour of alternatives so you can pick the right tool for your dataset.

Ambient RNA: a clear definition and a real problem

Ambient RNA refers to cell‑free messenger RNA floating in the capture solution during droplet encapsulation. These transcripts leak from stressed, lysed, or dying cells and then hitch a ride into droplets where they are barcoded and reverse‑transcribed along with the RNA from a genuine cell. Even droplets with no cell can acquire a measurable transcript profile purely from this background pool. In practice, the ambient pool often reflects highly expressed genes from abundant cell types, which is why you see hallmark markers faintly expressed in unrelated clusters. That background signal is not biology; it is contamination.

This matters because core analysis steps—clustering, differential expression, and cell type annotation—assume cell‑specific expression. Ambient RNA breaks that assumption. It can make distinct cell populations appear closer in latent space, blur marker boundaries, and seed false positives in differential tests. In complex tissues or necrotic samples, contamination fractions can vary widely among droplets, which means a simple global correction rarely suffices. Tools that explicitly model and remove ambient contributions are therefore essential quality steps rather than optional polish.

How contamination skews downstream analysis

The most visible effect is marker “leakage.” Suppose you profile a tumor microenvironment with abundant tumor cells expressing KRT genes and a minority immune compartment. Without decontamination, low‑level keratin reads may show up in T cells, confusing annotation and masking subtle immune heterogeneity. The bias doesn’t stop at labels. Ambient RNA can:

Inflate apparent co‑expression of genes that are not co‑regulated in the same cell.
Distort cluster boundaries, especially when highly expressed markers from dominant cell types bleed into neighbors.
Mislead pathway and ligand–receptor analyses that depend on precise presence/absence calls.

Researchers have documented these artifacts across platforms and datasets, and they consistently find that cleaning ambient signal improves biological interpretability without sacrificing genuine cell‑intrinsic expression.

DecontX: Bayesian decontamination that respects cell identity

DecontX approaches the problem with a simple idea grounded in a rigorous model. It assumes that each cell’s observed counts are a mixture of two sources: native expression from the cell’s true population and contaminating counts drawn from the ambient pool, itself approximated by expression from other populations in the dataset. Rather than subtracting a fixed background, DecontX estimates, for every cell, the fraction of contamination and the gene‑level distribution of those contaminants. It uses cell cluster labels as context—if you don’t supply them, DecontX can derive clusters internally—because population structure helps define what “other populations” look like. The inference is performed using a fast variational Bayesian procedure, returning two matrices: decontaminated native counts and estimated contamination counts, which you can feed into your usual downstream workflow.

A nice practical detail is that DecontX adapts to heterogeneous contamination. In mixed‑species or multiplexed datasets, it recovers contamination levels that agree with ground truth, and in PBMC data it removes spurious low‑level marker expression across immune subsets while preserving true signals. The result is crisper clusters and more faithful marker specificity, especially for highly expressed genes that tend to dominate the ambient pool.

Quickstart: running DecontX in R

You can drop DecontX into an existing SingleCellExperiment or Seurat‑to‑SCE pipeline. It works best after basic droplet QC—ideally with a cell‑calling step like EmptyDrops—so that the input contains genuine cells rather than empty droplets masquerading as low‑RNA cells.

Here’s a minimal example in R that assumes you have a counts matrix and optional cluster labels. If you don’t have labels yet, DecontX will cluster internally.

# install.packages("BiocManager"); BiocManager::install(c("celda","SingleCellExperiment"))
library(celda)
library(SingleCellExperiment)

# sce holds raw UMI counts (genes x cells)
# Optionally, add your own cluster labels (e.g., from Seurat/Scanpy):
# colData(sce)$cluster <- my_labels

set.seed(123)
sce_dc <- decontX(sce, z = colData(sce)$cluster)

# Access decontaminated counts and per-cell contamination fractions
clean_counts <- decontXcounts(sce_dc)
contam_frac  <- colData(sce_dc)$decontX_contam

If you are starting from raw 10x output, consider calling cells with EmptyDrops before DecontX; it models whether a barcode’s profile deviates from the ambient pool and helps avoid passing empty droplets downstream. That separation of concerns—first decide which barcodes are cells, then decontaminate cell profiles—reduces over‑correction risks and speeds analysis.

Choosing among alternatives: DecontX, SoupX, and CellBender

No single method dominates every dataset, so it helps to know the trade‑offs.

SoupX estimates the ambient expression profile from empty droplets, infers a per‑cell contamination fraction, and adjusts counts accordingly. It integrates easily with popular R workflows and performs well when empty droplets are plentiful and representative. Because it relies on that ambient profile, thoughtful parameterization—especially in necrotic or unusually sparse datasets—can make a noticeable difference.

CellBender remove‑background takes a different route. It trains a deep generative model that jointly infers which barcodes are real cells and what portion of their counts come from background ambient RNA (and, in some modes, barcode swapping). It often excels on very large 10x datasets, but it expects raw CellRanger HDF5 input and benefits from a GPU. The command‑line interface makes it straightforward to slot into a preprocessing stage before Seurat or Scanpy.

DecontX’s sweet spot is when you trust your clusters or have coarse labels you can supply. By modeling contamination as coming from other observed populations, it tends to preserve rare‑cell markers that might be over‑subtracted by purely global background models, while still shrinking the diffuse halo of high‑abundance markers around dominant cell types. In cross‑platform benchmarks, it reduces spurious cross‑population signal and tightens clusters without flattening meaningful within‑cluster heterogeneity.

In practice, a robust pipeline often uses these tools in sequence or combination. For example, call cells with EmptyDrops, run DecontX to remove population‑informed ambient signal, and—on especially large or messy datasets—validate with SoupX or CellBender to check that marker restoration and cluster structure behave as expected. The goal isn’t to force methods to agree; it’s to ensure that your biological conclusions no longer hinge on background artifacts.

Summary / Takeaways

Ambient RNA is not a niche annoyance; it’s a pervasive confounder in droplet‑based scRNA‑seq that can erode marker specificity, blur clusters, and mislead biological inferences. DecontX offers a principled, Bayesian fix that respects population structure and produces per‑cell contamination estimates you can trust. SoupX and CellBender provide complementary angles—one leveraging empty droplets for a clean ambient profile, the other learning background with a deep generative model—so you can tailor decontamination to your data and compute budget. As a simple next step, run a small end‑to‑end test: call cells, apply DecontX, and compare cluster separations and marker distributions before and after. If the biology snaps into focus, you’ve likely removed an ambient fog you could see but couldn’t name.