Single-Cell Proteomics for Data Scientists: Structures & QC

Jonathan Alles

EVOBYTE Digital Biology

Introduction

If you already speak fluent scRNA‑seq, single‑cell proteomics can feel like switching from weather reports to live traffic. Transcripts forecast what a cell plans to do; proteins show the lanes actually moving, the bottlenecks, and the detours that regulation imposes along the way. That practical view is why single‑cell proteomics (SCP) by mass spectrometry is gaining momentum. With careful sample prep, modern instruments, and smarter acquisition, you can now quantify hundreds to thousands of proteins in individual cells, often with better reproducibility than many expect for picogram inputs. And once you see how the data are structured, how to think about missingness, and what normalization and batch effects look like on the protein side, the analysis becomes approachable for any data scientist comfortable with single‑cell matrices. To ground the discussion, we will focus on mass‑spectrometry‑based SCP rather than antibody‑based methods like CITE‑seq, which tag proteins but read them out by sequencing rather than MS. For orientation on the technology landscape and current best practices, recent reviews and recommendations are invaluable guides.

From spectra to matrices: what single‑cell proteomics actually measures

At heart, MS‑based proteomics quantifies peptides—short protein fragments—then infers proteins from those peptide signals. Two acquisition styles dominate. Data‑dependent acquisition (DDA) stochastically selects peptide precursors for fragmentation, which can increase missingness across runs. Data‑independent acquisition (DIA) fragments predefined windows and then deconvolves mixed signals, trading algorithmic complexity for higher completeness. Modern DIA variants like diaPASEF marry trapped ion mobility with DIA to push sensitivity at low input, a useful trait when each “sample” is one cell. In parallel, multiplexing strategies speed things up. Isobaric tags (e.g., TMT) enable pooled analysis with carrier channels (the idea behind early SCoPE‑MS), while newer non‑isobaric multiplexing such as plexDIA increases throughput with library‑free DIA, trimming per‑cell acquisition time and raising within‑plex completeness. Together these innovations make SCP a practical, increasingly standardized workflow.

A concrete example helps. In SCoPE2, single macrophages were processed with an isobaric carrier and analyzed across multiplexed runs. The team reported that, per gene, proteins contributed many more measured molecules than transcripts, improving count statistics at the single‑cell level and revealing macrophage state differences not obvious from RNA alone. That result doesn’t say “proteins are always better”; it does illustrate how the proteome can sharpen boundaries between similar states and capture regulation downstream of mRNA.

On the label‑free side, optimized diaPASEF on high‑sensitivity platforms reduces sampling stochasticity and improves quantitative precision for low‑amount samples, including single‑cell equivalents. In practical terms, that means fewer “Swiss‑cheese” matrices and more reliable cross‑cell comparisons when you avoid tags altogether.

The data structures you’ll meet

Think in layers. Raw MS data yield peptide‑spectrum matches (PSMs). PSMs roll up to peptides, and peptides roll up to protein groups. Each layer can form its own feature‑by‑cell matrix, and the lineage among layers is important for QC and aggregation choices. In TMT experiments, a single LC‑MS run contains multiple labeled samples (channels), so “cell” lives inside a run/channel pair. In label‑free DIA, “cell” often equals “run.” Either way, you’ll carry two tables through analysis: a wide matrix of intensities and a tidy metadata frame describing each cell’s acquisition, batch, preparation plate, tag/plex, and biological condition.

Single‑cell proteomics pipelines commonly store these layers in specialized containers that preserve hierarchical links and metadata. In R/Bioconductor, QFeatures and scp formalize this structure and expose standard operations—QC at the feature or cell level, aggregation from PSMs to proteins, normalization, and batch correction—so analyses are reproducible and comparable across datasets. Even if you prefer Python, reading the scp workflow paper will clarify how most SCP datasets are organized and what operations are expected at each step.

What about size and sparsity? Depending on chemistry and instrument settings, a typical single‑cell matrix today might contain hundreds to a few thousand proteins per cell. plexDIA, for instance, demonstrated fast, high‑completeness quantification within a plex, which is valuable when downstream methods assume consistent coverage across cells. Keep in mind that completeness across different plexes or batches is always lower than within a single plex, so design and normalization are inseparable here.

Missingness isn’t a nuisance variable

If you come from scRNA‑seq, you already expect sparse matrices. But proteomics missingness has different mechanics and meanings. In DDA, peptides can be missed simply because the instrument never selected them for fragmentation in a given run. In DIA, detection is more systematic, but deconvolution may fail when signals are weak or co‑fragmentation is complex. On top of that, biology matters: if a protein is unexpressed in a subset of cells, the corresponding peptides are truly missing not at random (MNAR).

As pipelines work to raise completeness, they may propagate identifications across runs by matching chromatographic features without collecting MS/MS every time. In the label‑free community this is often called match‑between‑runs (MBR) or more generally peptide identity propagation (PIP). It can dramatically inflate coverage, especially in sparse single‑cell data—but it also carries a non‑trivial false‑match risk if you do not control FDR at the propagation step. Recent work has emphasized explicit FDR control for PIP/MBR as a requirement rather than an optional filter, particularly because SCP can derive a large fraction of identifications from propagation alone. Treat propagated IDs with the same statistical discipline you apply to direct PSMs.

This leads to imputation. You will be tempted to “fix” zeros; proceed cautiously. If values are MNAR due to limits of detection, left‑censored models or minimal‑value substitution per protein can be defensible for specific tasks like differential abundance, but global imputation can blur phenotypes and overstate certainty. Better first steps are design‑level strategies that minimize missingness—consistent sample prep, within‑plex comparisons, prioritization or targeted inclusion lists—and acquisition choices like DIA that reduce stochastic dropouts. Methodological work and community guidance published in 2023 lay out these priorities clearly.

Normalization and batch effects: familiar goals, proteomics‑specific tactics

Normalization in SCP aims to separate biological signal from the layered biases of sample prep, chromatography, ionization, and multiplex chemistry. There is no one recipe, but there is a practical playbook that maps well to habits you already have from RNA.

Start by working on the appropriate scale. Intensities are roughly log‑normal, so log2 transformation stabilizes variance. Next, center per cell to correct for global loading differences—median or robust location works well—then address batch with methods that respect experimental design. If you used isobaric tags, normalize within each TMT set first, often by referencing a pooled channel or by median‑centering the reporter ion intensities, and then consider cross‑set correction. For label‑free DIA, global scaling or variance‑stabilizing strategies are common before batch modeling. ComBat‑style empirical Bayes can remove plate‑ or run‑level shifts, but only after you confirm that biological groups are balanced across batches or you model them explicitly; otherwise, you risk erasing real biology.

A short Python sketch shows the flow on a protein‑by‑cell matrix; adapt it to peptides if you prefer to aggregate later.

import pandas as pd, numpy as np
from neurocombat_sklearn import Combat

# X: proteins x cells, meta: cell_id, batch, condition
X = pd.read_csv("protein_abundance.csv", index_col=0)
meta = pd.read_csv("cell_metadata.csv")
X = X.loc[:, meta.cell_id]

# filter sparse proteins, log-transform
X = X[X.notna().sum(1) >= int(0.5 * X.shape[1])]
X = np.log2(X + 1)

# per-cell median centering
X = X.sub(X.median(axis=0), axis=1)

# batch correction (be sure to include 'condition' if unbalanced)
combat = Combat()
X_bc = pd.DataFrame(combat.fit_transform(X.T, meta["batch"]).T,
                    index=X.index, columns=X.columns)

# simple left-censor proxy: impute per-protein minima minus an offset
mins = X_bc.min(1)
X_imp = X_bc.apply(lambda col: col.fillna(mins - 1.0))

Two final notes. First, perform QC and filtering before heavy correction; poor‑quality cells drive artifacts. Second, normalization should match the chemistry. Ratio compression is a known feature of isobaric workflows in the presence of a strong carrier or co‑isolation; if effect sizes look smaller than expected, that may be physics, not biology. Community recommendations published in 2023 emphasize reporting these choices and their rationale alongside results.

QC that moves the needle

Good SCP datasets succeed or fail on sample prep and acquisition consistency, and your QC should reflect that. Begin with per‑cell depth: the number of confidently quantified proteins (after FDR control at the PSM/peptide level) is the protein‑side analogue of “genes per cell.” Watch identification rates and the fraction of signal assigned to known contaminants; unusually high keratin or trypsin peptides, for instance, often mark handling issues. Monitor channel‑level totals and isotopic impurities in TMT experiments, and track carryover and retention‑time stability regardless of chemistry.

On the feature side, examine coefficient of variation across technical replicates or across a reference channel if you use one. In label‑free DIA, completeness across runs is the canary in the coal mine; if it craters for a subset of runs, look for blocked emitters or unstable low‑flow chromatography. The newer DIA methods evaluated on high‑sensitivity platforms show that, with tuned parameters and additives that stabilize peptides at low amounts, you can maintain precision even at single‑cell‑equivalent inputs. Use those benchmarks as reality checks when you see odd patterns.

Finally, keep FDR front and center. Apply it at the PSM level, at the peptide level if you aggregate, and at the protein‑inference level. If your pipeline propagates identifications across runs, ensure FDR is controlled for the propagation step too. Single‑cell studies can derive a large share of IDs this way, so error control isn’t optional; it’s the difference between a crisp phenotype and a mirage.

Where protein measurements fit next to scRNA‑seq

Proteins complement transcripts in three practical ways. First, they integrate regulation that RNA cannot see—translation rates, protein degradation, and post‑translational modifications. That means protein levels often lag or diverge from mRNA when cells change state quickly, which is exactly when you need a measurement that reflects functional execution rather than intention. Second, proteins can be more stable and abundant than their transcripts, which improves counting statistics for many genes at single‑cell scale. SCoPE2’s macrophage study provides a helpful reference point for how that improved counting reveals state differences and post‑transcriptional regulation.

Third, SCP and scRNA‑seq can be designed to meet in the middle. If you want per‑cell multi‑omics, antibody‑sequencing methods like CITE‑seq add targeted surface proteins on top of RNA. If you need unbiased proteome coverage—including intracellular proteins and PTMs—mass‑spectrometry SCP is the right lens, and integration becomes computational. Map protein groups to genes carefully, remembering that multiple transcripts can encode one protein group and that a single protein may report pathway activity more directly than any one transcript. For integration, anchor‑based methods, canonical correlation, or joint embeddings that tolerate missingness work well when you down‑select to robust, consistently observed proteins. When multiplex strategies like plexDIA keep within‑plex completeness near 100%, alignment across matched conditions becomes easier and more statistically honest than trying to impute your way out of design gaps—yet another reason to invest in acquisition planning up front.

As throughput scales, the workflow is converging on patterns already familiar from single‑cell RNA: standardized containers, declared QC thresholds, explicit missingness handling, and published guidelines for reporting. If you model from those expectations and keep an eye on the proteomics‑specific pitfalls—propagation FDR, isobaric ratio compression, and chromatographic stability—you will find that SCP analysis feels much less exotic than it first appears. Community recommendations published in 2023 codify many of these practices and are worth bookmarking before your first analysis sprint.

Summary / Takeaways

For data scientists, single‑cell proteomics is not a completely new language so much as a different dialect of single‑cell analysis. The matrices still have features and cells; the metadata still define batches and conditions; the models still seek signal under structured noise. What’s new is the physics. Peptides, not transcripts, carry the information. Acquisition choice shapes missingness as much as biology does. Multiplexing and propagation can increase completeness but must be balanced by disciplined FDR control. And when you integrate with scRNA‑seq, proteins often sharpen or even revise the story RNA tells, especially for rapid or post‑transcriptionally regulated processes.

If you are planning your first SCP analysis, start with design—decide whether isobaric or DIA better fits your question—then set up your data structures so PSMs, peptides, and proteins stay linked. Normalize on log scale, correct batches with explicit models, and be conservative about imputation. Above all, carry the same rigor you use for scRNA‑seq to the proteome, and let the two modalities challenge each other. The most interesting biology usually appears where they disagree.