Multimodal Single-Cell Omics: Data Integration

By EVOBYTE Your partner in bioinformatics

Introduction

Every cell is a story told in multiple languages. RNA transcripts narrate state, chromatin accessibility whispers regulatory intent, proteins reveal phenotype, and spatial context supplies setting and cast. When you measure only one of these voices, you get a compelling chapter. When you integrate them, you get the book.

In the last few years, multimodal single‑cell omics has shifted from a niche capability to a practical path for discovery. Drug hunters use it to triage targets with fewer blind spots. Biomarker teams use it to move beyond differential expression and anchor signatures to cell states, protein phenotypes, and neighborhoods in tissue. And machine learning researchers are training “foundational cell models” that learn generalizable representations from millions of cells, promising reusable building blocks for downstream tasks.

This overview walks through the core modalities—flow and mass cytometry, single‑cell RNA‑seq, single‑cell ATAC‑seq, spatial single‑cell omics, and direct protein measurements—then steps into how data integration actually works, why it changes the pace of drug and biomarker discovery, and how these datasets feed the next generation of cell foundation models.

The modalities, in plain language

Let’s set a shared vocabulary first, keeping the emphasis on what each modality contributes and why integrating them matters.

Flow cytometry and mass cytometry (CyTOF). For decades, flow cytometry has been the workhorse for immune phenotyping, screening millions of cells rapidly by measuring a handful of fluorescence‑tagged proteins. CyTOF swaps fluorophores for metal tags and time‑of‑flight detection, pushing panel sizes into the dozens while avoiding spectral overlap. Both are fast, robust, and cost‑effective, especially when you need to quantify cell‑surface and intracellular proteins across large cohorts. Where they struggle is transcriptome‑wide discovery and the subtlety of transcriptional states; that’s where sequencing complements them.

Single‑cell RNA‑seq (scRNA‑seq). This is the microscope for cell state. By counting transcripts per cell, scRNA‑seq reveals differentiation trajectories, activation states, and rare subpopulations you’d likely miss in bulk data. The catch is that RNA levels don’t always mirror protein abundance, especially for surface markers used to define phenotypes in the clinic. That’s why coupling RNA to protein readouts has become so valuable.

Protein measurements in sequencing workflows. CITE‑seq—cellular indexing of transcriptomes and epitopes by sequencing—attaches DNA barcodes to antibodies, letting you sequence protein abundance alongside RNA in the same cell. It bridges the phenotype–state gap, anchoring gene expression clusters to protein‑defined identities that practitioners know and trust from cytometry. It’s become a staple for immune profiling and tumor microenvironment studies, and it sets the stage for truly joint models of RNA and protein.

Single‑cell ATAC‑seq (scATAC‑seq). If RNA is the current state, ATAC‑seq reads the pages the cell has opened—regions of accessible chromatin that hint at which genes can turn on next. Because accessibility often precedes transcription, scATAC‑seq adds temporal nuance. Joint assays now measure accessibility with gene expression (and sometimes proteins) in the same cell, offering a causal thread from regulatory potential to realized state.

Spatial single‑cell omics. Cells don’t act alone. Spatial transcriptomics and related imaging‑based platforms map molecules back to their coordinates in tissue. This restores the context lost in suspension workflows, letting you ask which cell types preferentially co‑locate, which ligands and receptors co‑express across a physical interface, and how gradients of signaling or hypoxia sculpt phenotypes. As resolution and throughput climb, spatial data is no longer garnish; it’s part of the main course for mechanism‑driven discovery.

Taken individually, each modality is powerful. Taken together, they reduce ambiguity. A T cell that looks exhausted by RNA, expresses a specific inhibitory receptor by protein, sits next to macrophages expressing the corresponding ligand, and shows chromatin priming for effector genes is a far more confident therapeutic target or pharmacodynamic readout than any single view could provide.

How multimodal integration actually works

It’s tempting to picture integration as a single button labeled “combine,” but good multimodal analysis is more like ensemble musicians tuning to the same pitch before playing in harmony. Two kinds of integration dominate practice today: co‑assay integration, where multiple modalities are measured in the same cells, and cross‑assay integration, where we align different cells and datasets into a common space.

Co‑assay integration. When RNA and protein (or RNA and chromatin) are measured together, you can learn how much each modality should “speak” for each cell. A widely adopted strategy learns per‑cell modality weights and builds a joint neighborhood graph that respects whichever data type is most informative for that cell. This weighted‑nearest‑neighbors (WNN) concept underpins modern workflows in tools like Seurat for CITE‑seq and multiome data. In practice, it sharpens boundaries between closely related states, reduces misclassification, and improves mapping to references.

Cross‑assay integration. Often you won’t have all modalities for every sample. Here, probabilistic latent variable models help. For RNA+protein, a popular approach is totalVI, which jointly models both data types and disentangles biological signal from technical noise like ambient antibody capture. The result is a shared latent space you can use for clustering, denoising, batch integration, and differential testing across RNA and proteins simultaneously.

To make this concrete, here’s what minimal code might look like in R with Seurat’s WNN pipeline for a CITE‑seq experiment. It learns RNA and protein representations separately, then fuses them into a joint graph that respects each modality’s contribution per cell.

library(Seurat)
# adt = protein counts; rna = gene counts
obj <- CreateSeuratObject(rna)
obj[["ADT"]] <- CreateAssayObject(counts = adt)

obj <- NormalizeData(obj) |> FindVariableFeatures() |> ScaleData() |> RunPCA()
obj <- NormalizeData(obj, assay = "ADT", normalization.method = "CLR") |>
       ScaleData(assay = "ADT") |> RunPCA(assay = "ADT", reduction.name = "apca")

obj <- FindMultiModalNeighbors(obj, reduction.list = list("pca","apca"),
                               dims.list = list(1:30, 1:30))
obj <- RunUMAP(obj, nn.name = "weighted.nn", reduction.name = "wnn.umap")
obj <- FindClusters(obj, graph.name = "wsnn")

And here’s a compact Python sketch with scvi‑tools’ totalVI to build a joint latent space and return denoised RNA and protein values:

import scvi
import anndata as ad

adata = ad.read_h5ad("cite_seq.h5ad")  # RNA + protein layers
scvi.model.TOTALVI.setup_anndata(adata, protein_expression_obsm="protein")
model = scvi.model.TOTALVI(adata)
model.train(max_epochs=200)

latent = model.get_latent_representation()
rna_denoised, protein_denoised = model.get_normalized_expression(n_samples=25)

These models don’t just make prettier UMAPs. They enable practical tasks: transfer‑learning cell labels across cohorts, anchoring scRNA‑seq profiles to protein phenotypes used clinically, imputing missing modalities in partially observed datasets, and building robust references that tolerate changes in chemistry or instrument.

A note on spatial integration. Spatial data adds morphology and microenvironment. Many analyses project single‑cell references into spatial coordinates, deconvolving spots into cell‑type proportions or assigning single‑cell states to pixels. Newer methods also merge histology features and spatial expression to define tissue domains with higher fidelity—think of it as integrating the “what” with the “where” so neighborhood structure informs state calls.

Why multimodal analysis accelerates drug and biomarker discovery

Consider a familiar scene: a translational team is evaluating two immuno‑oncology targets. Both are differentially expressed in tumor‑infiltrating lymphocytes at the RNA level. On paper, they look similar. Multimodal data breaks the tie.

In one dataset, CITE‑seq shows that Target A’s protein is abundant on a subset already marked by exhaustion proteins, and spatial maps place those cells at the tumor–stroma interface where they directly contact myeloid cells expressing a matching ligand. ATAC‑seq reveals accessible enhancers upstream of the target gene and key exhaustion‑program transcription factors, suggesting regulatory commitment rather than a transient spike. When the team overlays patient response data, the spatially enriched Target A signature weakly correlates with non‑response to the current standard, strengthening the rationale that modulating that axis could matter.

Now look at Target B. Its RNA shifts are real, but protein is modest, the chromatin landscape doesn’t show clear priming, and cells expressing it sit in regions with scarce ligand. The prior looks weaker; you may still pursue it, but not as the lead.

Three themes show up repeatedly when teams make these calls with multimodal evidence.

First, specificity. Protein markers validate that your transcriptional cluster is the phenotype you think it is. This is crucial when biomarkers will later be measured by immunoassays or flow cytometry in clinical trials. Methods like CITE‑seq helped standardize that hand‑off and spurred joint RNA–protein models that are now routine.

Second, mechanism. Chromatin accessibility precedes transcription and links state to upstream regulators. When RNA and ATAC are measured together or thoughtfully aligned, you can connect targets to candidate enhancers, transcription factors, and lineage biases. That causal chain—open enhancer, TF binding motif, induced transcript, translated protein—turns a list of DEGs into a story about regulation and response.

Third, context. Spatial transcriptomics anchors everything in place. Biomarker panels derived from dissociated cells often perform better when refined by neighborhood information: which cell types co‑occur, which ligand–receptor pairs are face‑to‑face, and where gradients of hypoxia or fibrosis reshape the microenvironment. It’s not just “what is high” but “where and with whom,” which is exactly how therapies succeed or fail in tissue.

A quick example from the lab. In a two‑week sprint to nominate pharmacodynamic markers, a team combined scRNA‑seq with CITE‑seq in peripheral blood, used totalVI to generate a denoised joint space across donors, and then mapped those states into tumor sections using a spatial workflow. The final panel included one RNA signature tracked by qPCR and two protein markers tracked by flow. When the first in‑vivo study read out, the protein markers shifted earlier and more robustly than the RNA alone—exactly what integration predicted.

The business impact is straightforward: fewer cycles spent chasing fragile signatures, clearer go/no‑go at portfolio gates, and more informative early clinical readouts.

From multimodal datasets to foundational cell models

As the field accumulates millions of single cells across tissues, perturbations, and species, a new pattern has emerged: pretrain once, adapt everywhere. Foundation models for cells aim to learn general representations—embeddings that capture biology—so downstream tasks require minimal supervision.

One influential example is Geneformer, pretrained on roughly 30 million single‑cell transcriptomes. By treating gene expression vectors like “sentences” and training attention‑based architectures, Geneformer learns to encode network relationships and hierarchical structure. Fine‑tuned on sparse task‑specific datasets, it improves predictions in tasks from regulatory inference to disease modeling, and offers a practical path when sample sizes are tight.

What does this have to do with multimodality? Everything. The most useful representations are the ones that can bridge modalities and contexts. When you align RNA with proteins and chromatin at scale—either via co‑assays like CITE‑seq or by robust cross‑assay mapping—you give foundation models the supervision they need to “understand” how state, phenotype, and regulatory potential tie together. This opens doors to zero‑ or few‑shot annotation, cross‑species mapping, in silico perturbation prediction, and smarter deconvolution of spatial data.

A practical path many teams follow looks like this. First, build or adopt a high‑quality multimodal reference for your tissue of interest—ideally with joint RNA–protein and RNA–ATAC data and matched spatial sections. Second, pretrain or fine‑tune a foundation model on that corpus to capture the specific biology of your domain while retaining general knowledge. Third, use that model to accelerate routine tasks: label transfer to new cohorts, batch integration across sites and chemistries, prioritization of regulator–target links, and fast what‑if simulations of pathway inhibition or activation.

We should be clear‑eyed about limitations. Foundation models are powerful but not oracles; zero‑shot performance can lag in out‑of‑distribution settings, and interpretability requires deliberate design, not wishful thinking. That’s another reason multimodal training data matters: it grounds model abstractions in measured proteins, open chromatin, and tissue geography. When a model’s predicted state change also shifts a protein marker and appears in a specific niche, confidence rises.

Summary / Takeaways

Bringing modalities together is less about stacking more data and more about reducing ambiguity. RNA outlines state. Chromatin accessibility hints at where it’s going. Proteins validate phenotype and connect to clinical assays. Spatial maps place it all in the tissue where biology actually happens. Integration methods like WNN and totalVI make these views coherent at the single‑cell level, enabling sharper annotation, denoising, and transfer across cohorts.

For drug and biomarker teams, multimodal analysis turns differentially expressed lists into mechanisms anchored in phenotype and context. It de‑risks target selection and makes pharmacodynamic signatures more robust. And as these datasets scale, they feed foundational cell models such as Geneformer that promise reusable embeddings and faster iteration across tasks—so long as we keep them grounded in multi‑omic reality.

If you’re planning your next study, start by asking: which uncertainties could another modality resolve? Then design integration in, not as an afterthought. A modest CITE‑seq panel, a small ATAC cohort matched to your RNA, or a few spatial sections can turn a good dataset into a decisive one.

Further Reading

Leave a Comment