TileDB Arrays for large-scale omics

Jonathan Alles

EVOBYTE Digital Biology

By EVOBYTE Your partner in bioinformatics

Introduction

Omics research isn’t just “big data.” It’s wide, sparse, and relentlessly multimodal. A single project might mix genome‑wide variants for millions of individuals, single‑cell gene expression for tens of millions of cells, and terabytes of bioimaging—then ask to join all of it with clinical records to train an AI model. If you’ve ever tried to make that fly on a classic tabular data warehouse, you’ve felt the friction: wide tables that keep widening, exploding joins, painful interval queries, and storage bills that grow faster than insight.

This is where TileDB enters the picture. TileDB stores data as multi‑dimensional arrays—dense or sparse—rather than rows and columns. That one design choice changes what’s possible for omics at population and single‑cell scale. In this post, we’ll unpack the foundations behind TileDB’s array model, explain why omics strains traditional warehouses, show concrete wins for genomics and single‑cell workloads, and close with a timely update on the TileDB–Databricks collaboration that’s aiming to make multimodal health data truly AI‑ready.

Why classic tables struggle at omics scale and shape

Tabular warehouses shine when your data fits neatly into records, when most columns are populated, and when common queries scan partitions or use standard indices. Omics flips those assumptions.

First, the shapes are extreme. A landmark single‑cell RNA‑seq dataset with 1.3 million cells needs on the order of tens of gigabytes even in a compressed sparse format, and hundreds of gigabytes if coerced into dense matrices. Pushing matrices of this scale through systems built for relational rows quickly hits memory limits, shatters query plans, or forces heroic engineering to chunk and cache.

Second, the data is sparse and irregular. A typical single‑cell expression matrix is mostly zeros—often around 90%—and epigenomic matrices such as scATAC‑seq are even sparser. Storing such matrices in columnar tables wastes space and compute because the storage engine keeps hauling around empty values, while SQL queries struggle to express and optimize the interval and slice operations that scientists actually need.

Third, primary omics formats don’t map cleanly to relational schemas. Variant data in VCF or gVCF can balloon to millions of records per sample if you include non‑variant positions for clinical use cases, which wrecks both storage and query performance in row stores. When new samples arrive, you face the dreaded “N+1” problem: every incremental ingest forces costly reshuffles or materializations to keep downstream statistics current.

Finally, multimodality is the rule, not the exception. Biobanks and health systems have genomics, imaging, pathology, and EHR data—all with different access patterns and scales. Forcing everything into a single tabular abstraction creates either silos (because some data stays as files) or brittle ETL pipelines that try to flatten the world.

TileDB foundations: multi‑dimensional arrays for multimodal omics

TileDB takes a different path: everything is an array. An array has dimensions (how you slice data) and attributes (what you store at each coordinate). Arrays can be dense, like a regular grid of image pixels, or sparse, like “only store non‑zero gene counts” or “only store variant records that exist.” Under the hood, TileDB persists these arrays as chunked “tiles” and appends new data as immutable fragments, enabling time travel, incremental updates, and parallel writes without global coordination. It’s cloud‑native from the start, so arrays live comfortably on S3, Azure Blob, or GCS.

Why does this matter? Because arrays let you query the way omics scientists think. Want “all variants on chr7:55M–56M across these 200,000 samples”? That’s a slice on genomic position and sample dimensions. Want “counts for genes 1–10,000 across a subset of cells in cluster 12”? That’s a slice on gene and cell dimensions. The storage engine fetches only the tiles that intersect your slice, and sparse encoding avoids moving zeros you don’t need. The result is less I/O, fewer deserialization penalties, and much faster interactive analysis at scale.

TileDB layers a full platform on top—TileDB Cloud—for governance, sharing, serverless computation, and interoperability. Crucially, it exposes consistent APIs in Python and R and integrates with distributed compute, so you can use the right tool for the job without copying data into new systems. That keeps multimodal projects on a single substrate rather than a patchwork of file systems and ad‑hoc loaders.

Practical wins in genomics and single‑cell with TileDB‑VCF and TileDB‑SOMA

The array idea becomes concrete through domain‑specific libraries that meet researchers where they work.

For population genomics, TileDB‑VCF models variant data as a sparse three‑dimensional array: positions along the genome, samples, and attributes such as alleles and quality metrics. Ingest scales out by splitting parsing across workers with minimal coordination, and updates append fragments rather than rewriting giant monoliths. This lets you refresh cohort summary statistics or add new biobank samples without blowing up storage or waiting hours for compaction jobs. The “N+1” tax that hobbles many pipelines largely disappears.

For single‑cell omics, TileDB‑SOMA implements SOMA, an open, language‑agnostic specification for representing collections of annotated matrices. A SOMA experiment neatly organizes cell metadata, measurement matrices (e.g., RNA counts), and derived results into an array‑native layout. Because it’s sparse and chunked, you can analyze atlas‑scale data without loading everything into memory, and you can interoperate from both Python/Scanpy and R/Seurat on the exact same cloud dataset. That saves weeks of wrangling when consortia share data across teams and languages.

Let’s make that tangible. Imagine a precision‑oncology team that needs to cross‑reference exome variants for 500,000 patients with single‑cell profiles from matched tumors. In a warehouse, you’d normalize VCFs into wide variant tables, stage sparse gene counts into dense or semi‑sparse columns, and fight through joins and shuffles each time you drill into a locus or cell subset. With TileDB, you slice arrays by genomic interval and by cell subset, join on patient IDs at the metadata layer, and stream just the tiles you need. That enables interactive “what‑if” exploration and shortens the path from data to a ranked variant‑gene‑cell hypothesis set.

Here’s a short example showing how you might explore a cohort stored with TileDB‑VCF from Python. You specify a genomic region and a subset of samples, and the library streams back just those slices as Arrow tables or NumPy arrays you can hand to downstream code.

import tiledbvcf

ds = tiledbvcf.Dataset("s3://my-bucket/biobank/variants_tiledbvcf", mode="r")
# Pull variants on chr7 between 55–56 Mb for a subset of samples
region = "chr7:55000000-56000000"
samples = ["SAMPLE_0001", "SAMPLE_0420", "SAMPLE_2026"]

# Stream records as Arrow for efficient filtering/compute
records = ds.read(regions=[region], samples=samples, attrs=["alleles", "fmt_DP"]).to_arrow()
print(records[:5])

For single‑cell data, the SOMA API provides similarly ergonomic access. You can filter cells by metadata, then pull just the slice of the expression matrix you need for a clustering step in Scanpy—without materializing the full matrix in memory.

import tiledbsoma as soma
import numpy as np

exp = soma.open("s3://my-bucket/soma/luad_tumor.soma")
cells = exp["obs"].read().concat().to_pandas()
keep = cells.query("patient_id == 'PT0091' and mt_frac < 0.05").index.values

# Fetch a gene panel for those cells as a SciPy sparse CSR
genes = ["TP53", "EGFR", "KRAS", "ALK"]
X = exp["ms"]["RNA"]["X"]["counts"].read(
    coords=(genes, keep)
).coos().tocsr()

# Hand off to Scanpy or your own pipeline
print(X.shape, np.mean(X.data))

These examples are tiny on purpose, but they scale up—both implementations were built explicitly for biobank‑ and atlas‑size workloads, and they preserve compatibility with community tooling.

Working with Databricks: what the TileDB–Databricks collaboration means

In July 2025, TileDB announced a strategic partnership with Databricks focused on healthcare and life sciences. The goal is straightforward and timely: eliminate multimodal data silos so AI agents and analytic workflows can reason across genomics, imaging, and clinical records without expensive migration or lossy transformation. The integration is bi‑directional, designed to let data stored in TileDB’s high‑performance arrays run seamlessly on the Databricks Data Intelligence Platform—and vice versa—with unified governance. At announcement, access was in private preview with features rolling out in phases.

Practically, this matters in three ways. First, governance meets modality. Unity Catalog can continue to police access while TileDB keeps the scientific payload in array form, so you don’t have to “flatten to fit” to stay compliant. Second, compute meets shape. Spark and Photon can orchestrate and accelerate cross‑modal workflows while TileDB serves efficient slices for variant intervals or sparse cell subsets, avoiding wasteful scans. Third, AI meets provenance. Training and evaluating models on “all the data” becomes more than a slide; models can draw from arrays of variants, matrices of single‑cell counts, and semantically rich clinical tables without shuffling terabytes between systems or proliferating copies.

If you already run Databricks for your lakehouse, the partnership points to a more natural division of labor. Keep transactional or tabular analytics in Delta Lake and Databricks SQL, and keep high‑dimensional scientific data in TileDB arrays. Use shared governance and cross‑platform connectors to compose pipelines that respect each modality’s physics. That architecture also helps with total cost of ownership: fewer redundant copies, fewer long‑running ETL jobs that normalize sparse arrays into brittle tables, and a cleaner path from raw data to training‑ready features.

Summary / Takeaways

Omics data is hard on tabular warehouses because it is intrinsically high‑dimensional, sparse, and multimodal. Forcing it into rows and columns leads to wide schemas, heavy joins, poor interval access, and a compounding “N+1” problem as projects grow. Arrays address these pain points head‑on. With TileDB’s multi‑dimensional storage, sparse encoding, and fragment‑based updates, you can slice by genomic region or cell subset, interoperate across Python and R, and govern and share data without copying it across systems.

In genomics, TileDB‑VCF turns cohorts of VCFs into analysis‑ready arrays so you can scale ingest and query without rewriting the world every time you add samples. In single‑cell, TileDB‑SOMA brings a clean, cross‑language structure to annotated matrices, enabling atlas‑scale analyses that fit both memory and minds. And with the TileDB–Databricks collaboration, you no longer have to choose between governance on your lakehouse and performance on scientific arrays—the two are coming together so AI can see the full, multimodal picture.

If your team is wrestling with widening tables and slowing pipelines, the next step is simple: prototype a critical slice query—say, a disease locus across a large cohort, or a tumor‑specific cell subset—using TileDB’s Python or R APIs. Measure I/O and wall‑time against your current warehouse path. The results will tell their own story.

More posts on databases for omics research

Free Spatial Transcriptomics Datasets: SODB & SpatialDB

dbVar for Computational Biologists: Guide to structural variation data