Illustration of four researchers analyzing data on screens, a laptop, and a tablet, with a large display showing a cellular image.

Single-Cell Genomics: Human Cell Atlas & CELLxGENE

Table of Contents
Picture of Jonathan Alles

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

If you’re building single‑cell pipelines or training models on gene expression, the best starting point is high‑quality, open data. Two free resources stand out for human single‑cell genomics: the Human Cell Atlas (HCA) Data Portal and CZ CELLxGENE Discover/Census. They’re complementary. HCA focuses on curated, consented datasets and emerging “draft atlases” across organs, while CELLxGENE Discover offers a fast, programmatic way to pull harmonized matrices across studies. In this guide, you’ll learn why the Human Cell Atlas exists, what kinds of data you can expect, how to access each portal, and when to favor one over the other.

Why the Human Cell Atlas exists: mapping healthy and diseased tissues

The HCA is an international consortium with a bold, practical goal: map all human cell types across the lifespan to transform how we diagnose and treat disease. Think of it as a reference atlas for the body, integrating transcriptomic, spatial, and other modalities as methods mature. The consortium publishes initial “Biological Network” atlases—like lung, nervous system, eye, and organoids—that assemble community datasets into coherent maps. These early atlases are already informing questions from development to immunity, and they set the stage for a first draft spanning 18 organs and systems.

You don’t need to wait for the “final” atlas to get value. The HCA Data Portal already hosts tens of millions of cells across hundreds of projects, aggregated from nearly a thousand labs worldwide. The front page shows live counts and entry points into organs and systems (for example lung, heart, kidney, immune, and more), so you can immediately drill down to a biological domain of interest.

Hands‑on with the Human Cell Atlas Data Portal

The HCA Portal is built for interactive discovery and reproducible export. Start in the Data Explorer, filter by tissue, disease status, assay, donor characteristics, or project, and then export exactly what you selected. You can download a manifest, fetch files with a generated curl command, or push the cohort into a Terra workspace for analysis in notebooks and workflows. This route works well when you want the authoritative files and metadata for a specific cohort and you plan to run your own standard workflows.

Under the hood, today’s HCA data are indexed and stored in the Terra Data Repository (TDR). If you used the legacy HCA CLI or Matrix Service a few years ago, note the workflow has changed: the CLI and the old matrix service were retired, and exports now rely on Portal‑generated curl commands or Terra workspaces. The upshot is simpler, snapshot‑based releases with clearer provenance and managed access when needed. Downloaded and exported data are governed by the HCA Data Release Policy and use a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

A quick example helps. Suppose you’re studying healthy lung immune cells. In the Portal, filter tissue to “lung” and donor/specimen disease to “normal” or “no disease reported,” preview the selected sample count, and export the files to Terra. From there, you can run Scanpy, Seurat, or WDL workflows like Cumulus at cloud scale without babysitting storage or tooling.

CZ CELLxGENE Discover and the Census: fast, programmatic access at scale

Where HCA is a curated entry point to project‑level data and draft atlases, CZ CELLxGENE Discover is an aggregator and browser designed for speed. Its companion, the CELLxGENE Census, is a versioned, harmonized container of single‑cell RNA data that you can query directly from Python or R. The Census sits in an open S3 bucket and exposes a clean API built on TileDB‑SOMA, so you can slice by tissue, cell type, or disease and stream directly into AnnData or Seurat objects. For model training or exploratory analysis across many studies, this is a big time saver. Long‑term supported “LTS” releases give you stable snapshots, while regular releases keep the data fresh.

Here’s how that looks in practice. You can open the Census, pull a lung subset of healthy human cells, and hand it to Scanpy in a few lines. The AnnData you get back includes harmonized metadata fields such as tissue, cell_type, disease, and more, so downstream filtering feels natural.


import cellxgene_census as cgc
import scanpy as sc

with cgc.open_soma() as census:  # use a specific census_version for full reproducibility
    adata = cgc.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="tissue == 'lung' and disease == 'normal'"
    )

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.pca(adata)

You can also work fully offline by syncing a long‑term release of H5ADs to disk with the AWS CLI, no account required. That’s handy for batch jobs or air‑gapped environments.

aws s3 sync --no-sign-request \
  s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/h5ads/ ./h5ads/

Both approaches share a simple idea: keep the heavy lifting in the cloud until you need to compute, then bring down exactly what your analysis requires. The official docs cover additional features like normalized layers, precomputed embeddings, and the SOMA query language for efficient filtering.

Another powerhouse for discovery: EMBL‑EBI Single Cell Expression Atlas

If you prefer re‑analyzed, cross‑species data with interactive gene‑centric search, EMBL‑EBI’s Single Cell Expression Atlas (SCEA) is a great complement. It curates and reprocesses public datasets across many organisms, adds consistent metadata, and exposes intuitive visualizations so you can quickly check where a gene is expressed across cell types and conditions. Releases are documented with what changed, so you can track growth and features over time. For quick hypothesis checks or teaching, SCEA is an easy on‑ramp.

Summary / Takeaways

Open single‑cell resources have matured to the point where you can go from a biological question to a working dataset in minutes. Use the HCA Data Portal when you need authoritative project files, rich metadata, and a clean export path into Terra. Reach for CZ CELLxGENE Discover and the Census when you want harmonized matrices and fast, programmatic slicing across many studies. And keep SCEA in your toolkit for broad, curated exploration and gene‑centric views. With these three, your analysis can start with the right data and scale as your questions evolve. What question will you test first: a cell‑type signature in healthy tissue, or a disease‑specific perturbation across studies?

Further Reading