FACS Clustering in Python

Why this matters

High-dimensional FACS data can contain subtle immune or cell-state populations that are hard to isolate with manual gating alone. Once a panel grows beyond a handful of markers, unsupervised clustering becomes a practical way to find structure in the data, compare samples, and quantify population shifts.

In Python, you can build a clean workflow that reads FCS files, applies basic preprocessing, transforms marker intensities, clusters events, and visualizes the result in a low-dimensional embedding.

Summary

In this tutorial, you will:

  • load multiple FCS files from disk
  • inspect channel names and define a marker panel
  • apply simple preprocessing and density-based gating
  • build an event-by-marker matrix
  • transform intensities with an arcsinh transform
  • cluster cells with a graph-based workflow
  • visualize clusters with UMAP
  • summarize clusters by marker expression and sample abundance

Expected outcome: by the end, you will have cluster labels for each event, a UMAP plot, a cluster-by-marker heatmap, and per-sample cluster proportions you can use for downstream analysis.

Requirements

  • Basic Python, NumPy, and pandas skills
  • Python 3.10+
  • A folder of FCS files with a consistent marker panel
  • 8–16 GB RAM is usually enough for moderate datasets
  • No GPU required

1. Set up the environment

We will use:

  • FlowCal for reading FCS files and simple gating
  • pandas and numpy for tabular work
  • anndata and scanpy for clustering and visualization
  • matplotlib and seaborn for plots
python -m venv .venv
source .venv/bin/activate
pip install flowcal scanpy anndata scikit-learn pandas numpy matplotlib seaborn igraph leidenalg
mkdir -p data results

Place your .fcs files inside the data/ folder.

2. Inspect one FCS file and list channels

Before clustering, inspect one file so you know the exact channel names stored in the FCS header. This is important because the detector names in the file may not match the marker labels you use in the lab.

from pathlib import Path
import pandas as pd
import FlowCal

data_dir = Path("data")
fcs_files = sorted(data_dir.glob("*.fcs"))

if not fcs_files:
    raise FileNotFoundError("No .fcs files found in ./data")

example = FlowCal.io.FCSData(str(fcs_files[0]))

channel_table = pd.DataFrame({
    "channel": example.channels,
    "label": example.channel_labels()
})

print(channel_table)
print(f"Events: {example.shape[0]}")
print(f"Channels: {example.shape[1]}")

What is happening here

  • example.channels shows the internal channel names used for indexing
  • example.channel_labels() often shows more human-readable marker labels
  • you should use this output to build a reliable mapping from marker names to detector channels

3. Define the marker panel

For clustering, use biological marker channels, not technical channels like Time. In most cases, scatter channels are useful for QC and gating, but not for the actual clustering matrix.

Update the detector names on the right-hand side to match your own FCS files.

marker_map = {
    "CD3": "FL1-A",
    "CD4": "FL2-A",
    "CD8": "FL3-A",
    "CD19": "FL4-A",
    "CD14": "FL5-A",
    "CD16": "FL6-A",
    "CD56": "FL7-A",
    "HLA_DR": "FL8-A",
    "CD45RA": "FL9-A",
    "CCR7": "FL10-A",
    "CD25": "FL11-A",
    "CD127": "FL12-A",
}

scatter_channels = ("FSC-A", "SSC-A")
marker_names = list(marker_map.keys())
detector_channels = list(marker_map.values())

Tip

If your files already use marker names directly as channel names, your map may look like this instead:

marker_map = {
    "CD3": "CD3",
    "CD4": "CD4",
    "CD8": "CD8",
}

4. Read FCS files and apply basic preprocessing

Now we will build a reusable function that:

  1. loads one FCS file
  2. converts signal values to instrument-scaled fluorescence units
  3. removes unstable start/end events
  4. removes extreme values
  5. performs a simple density gate on FSC-A and SSC-A
  6. extracts the marker matrix
  7. optionally downsamples very large files for faster iteration
import numpy as np

def preprocess_fcs(
    path,
    marker_map,
    scatter_channels=("FSC-A", "SSC-A"),
    gate_fraction=0.70,
    max_events=30000,
):
    fcs = FlowCal.io.FCSData(str(path))

    # Convert channels using metadata stored in the FCS file
    fcs = FlowCal.transform.to_rfi(fcs)

    # Remove early/late unstable acquisition events
    fcs = FlowCal.gate.start_end(fcs, num_start=300, num_end=100)

    # Remove saturated or extreme values
    fcs = FlowCal.gate.high_low(
        fcs,
        channels=list(marker_map.values()) + list(scatter_channels)
    )

    # Keep the densest cloud in FSC/SSC space
    gated = FlowCal.gate.density2d(
        fcs,
        channels=list(scatter_channels),
        gate_fraction=gate_fraction,
        full_output=True
    ).gated_data

    # Extract only marker channels for clustering
    X = np.asarray(gated[:, list(marker_map.values())], dtype=np.float32)
    df = pd.DataFrame(X, columns=list(marker_map.keys()))
    df["sample"] = path.stem

    # Optional downsampling for responsiveness
    if len(df) > max_events:
        df = df.sample(max_events, random_state=42)

    return df

Load all samples

all_events = pd.concat(
    [preprocess_fcs(path, marker_map) for path in fcs_files],
    ignore_index=True
)

print(all_events.shape)
print(all_events["sample"].value_counts())
all_events.head()

What is happening here

This gives you one combined table where:

  • each row is one event
  • each marker column is one measured feature
  • the sample column tells you which FCS file the event came from

5. Apply an arcsinh transform

Raw fluorescence values usually span a large dynamic range. An arcsinh transform compresses high values while keeping low-end structure interpretable.

A common starting point for fluorescence data is a cofactor around 150, but you should still inspect your marker distributions and adjust if needed.

X_raw = all_events[marker_names].to_numpy(dtype=np.float32)
X_asinh = np.arcsinh(X_raw / 150.0)

transformed = pd.DataFrame(X_asinh, columns=marker_names)
transformed["sample"] = all_events["sample"].values
transformed.head()

Quick sanity check

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 4))
sns.histplot(transformed["CD3"], bins=100)
plt.title("Arcsinh-transformed CD3")
plt.xlabel("asinh(signal / 150)")
plt.tight_layout()
plt.show()

6. Create an AnnData object

Using AnnData makes it easy to keep the event matrix together with sample metadata and clustering results.

import anndata as ad

adata = ad.AnnData(
    X=X_asinh,
    obs=all_events[["sample"]].copy(),
    var=pd.DataFrame(index=marker_names)
)

# Keep a copy of the transformed but unscaled data
adata.layers["asinh"] = adata.X.copy()

print(adata)

Why this is useful

This container lets you store:

  • the marker matrix in adata.X
  • per-event metadata in adata.obs
  • per-marker metadata in adata.var
  • embeddings and clustering results in the same object

7. Run PCA, build a neighbor graph, and cluster

A practical approach for high-dimensional FACS data is:

  1. standardize markers
  2. run PCA
  3. build a nearest-neighbor graph
  4. cluster the graph with Leiden
  5. compute a UMAP embedding for visualization
import scanpy as sc

# Scale each marker to comparable range
sc.pp.scale(adata, max_value=10)

# Number of PCs should usually be smaller than or equal to marker count
n_pcs = min(10, adata.n_vars)

sc.pp.pca(adata, n_comps=n_pcs)
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=n_pcs, metric="euclidean")
sc.tl.leiden(adata, resolution=0.6, key_added="cluster")
sc.tl.umap(adata, min_dist=0.3)

print(adata.obs["cluster"].value_counts().sort_index())

What the main parameters do

  • n_neighbors: larger values make clusters smoother and less local
  • resolution: larger values usually produce more clusters
  • n_pcs: controls how much compressed signal you pass into graph construction

8. Visualize the clusters

A UMAP plot gives a quick view of the discovered structure.

embedding = pd.DataFrame(
    adata.obsm["X_umap"],
    columns=["UMAP1", "UMAP2"],
    index=adata.obs_names
)
embedding["cluster"] = adata.obs["cluster"].astype(str).values
embedding["sample"] = adata.obs["sample"].values

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=embedding,
    x="UMAP1",
    y="UMAP2",
    hue="cluster",
    s=6,
    linewidth=0,
    alpha=0.8,
    palette="tab20"
)
plt.title("UMAP of clustered FACS events")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="cluster")
plt.tight_layout()
plt.show()

Optional: color by sample

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=embedding,
    x="UMAP1",
    y="UMAP2",
    hue="sample",
    s=6,
    linewidth=0,
    alpha=0.8
)
plt.title("UMAP colored by sample")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="sample")
plt.tight_layout()
plt.show()

9. Summarize clusters by marker expression

Clusters are only useful if you can interpret them biologically. A cluster-by-marker mean table is often the fastest way to annotate populations.

cluster_means = (
    pd.DataFrame(
        adata.layers["asinh"],
        columns=marker_names,
        index=adata.obs_names
    )
    .assign(cluster=adata.obs["cluster"].values)
    .groupby("cluster", observed=True)
    .mean()
    .sort_index()
)

cluster_means.head()

Plot a heatmap

plt.figure(figsize=(12, 5))
sns.heatmap(cluster_means, cmap="viridis")
plt.title("Mean marker expression per cluster")
plt.xlabel("Marker")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()

What to look for

  • high CD3 with high CD4 suggests CD4 T cells
  • high CD3 with high CD8 suggests CD8 T cells
  • high CD19 suggests B cells
  • high CD14 suggests monocytes
  • high CD56 with low CD3 suggests NK-like populations

Use your full panel and biology context when assigning names.

10. Compare cluster abundance across samples

This is where clustering becomes especially useful for experimental comparisons.

cluster_props = (
    adata.obs
    .assign(cluster=adata.obs["cluster"].astype(str).values)
    .groupby(["sample", "cluster"], observed=True)
    .size()
    .rename("n_events")
    .reset_index()
)

cluster_props["fraction"] = (
    cluster_props.groupby("sample")["n_events"]
    .transform(lambda s: s / s.sum())
)

prop_table = (
    cluster_props
    .pivot(index="sample", columns="cluster", values="fraction")
    .fillna(0)
    .sort_index()
)

print(prop_table.round(3))

Stacked bar chart

ax = prop_table.plot(
    kind="bar",
    stacked=True,
    figsize=(10, 4),
    colormap="tab20"
)
ax.set_ylabel("Fraction of events")
ax.set_title("Cluster composition by sample")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

11. Optionally annotate clusters with cell-type names

After reviewing the heatmap, you can attach human-readable names.

annotation_map = {
    "0": "CD4 T cells",
    "1": "CD8 T cells",
    "2": "B cells",
    "3": "Monocytes",
    "4": "NK cells",
}

adata.obs["cell_type"] = (
    adata.obs["cluster"]
    .astype(str)
    .map(annotation_map)
    .fillna("unassigned")
)

adata.obs[["sample", "cluster", "cell_type"]].head()

12. Export the results

Save both the full analysis object and simple tables for reporting.

from pathlib import Path

out_dir = Path("results")
out_dir.mkdir(exist_ok=True)

adata.write(out_dir / "facs_clusters.h5ad")
adata.obs.to_csv(out_dir / "event_metadata.csv", index=True)
cluster_means.to_csv(out_dir / "cluster_marker_means.csv")
prop_table.to_csv(out_dir / "cluster_proportions.csv")

print("Saved results to:", out_dir.resolve())

Recap

You now have a complete Python workflow for high-dimensional FACS clustering:

  • FCS files were loaded and lightly gated
  • marker intensities were transformed with arcsinh
  • events were clustered with a graph-based method
  • UMAP was used for visualization
  • cluster marker signatures and sample proportions were summarized
  • outputs were exported for downstream analysis

This pattern is a strong starting point for immunophenotyping, cohort comparisons, and exploratory discovery in multiparameter flow cytometry.

FAQ

1. My channel names do not match the tutorial. What should I change?

Change the values in marker_map so they match the detector channel names from your own FCS files. The easiest way is to print example.channels and example.channel_labels() first, then build your mapping from that output.

2. My clusters look sample-driven instead of biology-driven. What should I do?

This often means the data still contains batch effects or acquisition differences. Start by checking compensation, transform settings, gating consistency, and instrument drift. If you have bead-based calibration available, standardizing intensities before clustering can help a lot. Also inspect the UMAP colored by sample to see whether one run dominates the structure.

3. How do I choose the arcsinh cofactor and Leiden resolution?

Treat both as tunable analysis parameters. A cofactor of 150 is a reasonable starting point for fluorescence data, but inspect marker histograms and known populations. For Leiden, try a small grid such as 0.4, 0.6, 0.8, and 1.0, then choose the setting that gives stable, biologically interpretable groups rather than the largest possible number of clusters.

Further Reading

Leave a Comment