FACS Clustering in Python

Why this matters

High-dimensional FACS data can contain subtle immune or cell-state populations that are hard to isolate with manual gating alone. Once a panel grows beyond a handful of markers, unsupervised clustering becomes a practical way to find structure in the data, compare samples, and quantify population shifts.

In Python, you can build a clean workflow that reads FCS files, applies basic preprocessing, transforms marker intensities, clusters events, and visualizes the result in a low-dimensional embedding.

Summary

In this tutorial, you will:

load multiple FCS files from disk
inspect channel names and define a marker panel
apply simple preprocessing and density-based gating
build an event-by-marker matrix
transform intensities with an arcsinh transform
cluster cells with a graph-based workflow
visualize clusters with UMAP
summarize clusters by marker expression and sample abundance

Expected outcome: by the end, you will have cluster labels for each event, a UMAP plot, a cluster-by-marker heatmap, and per-sample cluster proportions you can use for downstream analysis.

Requirements

Basic Python, NumPy, and pandas skills
Python 3.10+
A folder of FCS files with a consistent marker panel
8–16 GB RAM is usually enough for moderate datasets
No GPU required

1. Set up the environment

We will use:

FlowCal for reading FCS files and simple gating
pandas and numpy for tabular work
anndata and scanpy for clustering and visualization
matplotlib and seaborn for plots

python -m venv .venv
source .venv/bin/activate
pip install flowcal scanpy anndata scikit-learn pandas numpy matplotlib seaborn igraph leidenalg
mkdir -p data results

Place your .fcs files inside the data/ folder.

2. Inspect one FCS file and list channels

Before clustering, inspect one file so you know the exact channel names stored in the FCS header. This is important because the detector names in the file may not match the marker labels you use in the lab.

from pathlib import Path
import pandas as pd
import FlowCal

data_dir = Path("data")
fcs_files = sorted(data_dir.glob("*.fcs"))

if not fcs_files:
    raise FileNotFoundError("No .fcs files found in ./data")

example = FlowCal.io.FCSData(str(fcs_files[0]))

channel_table = pd.DataFrame({
    "channel": example.channels,
    "label": example.channel_labels()
})

print(channel_table)
print(f"Events: {example.shape[0]}")
print(f"Channels: {example.shape[1]}")

What is happening here

example.channels shows the internal channel names used for indexing
example.channel_labels() often shows more human-readable marker labels
you should use this output to build a reliable mapping from marker names to detector channels

3. Define the marker panel

For clustering, use biological marker channels, not technical channels like Time. In most cases, scatter channels are useful for QC and gating, but not for the actual clustering matrix.

Update the detector names on the right-hand side to match your own FCS files.

marker_map = {
    "CD3": "FL1-A",
    "CD4": "FL2-A",
    "CD8": "FL3-A",
    "CD19": "FL4-A",
    "CD14": "FL5-A",
    "CD16": "FL6-A",
    "CD56": "FL7-A",
    "HLA_DR": "FL8-A",
    "CD45RA": "FL9-A",
    "CCR7": "FL10-A",
    "CD25": "FL11-A",
    "CD127": "FL12-A",
}

scatter_channels = ("FSC-A", "SSC-A")
marker_names = list(marker_map.keys())
detector_channels = list(marker_map.values())

Tip

If your files already use marker names directly as channel names, your map may look like this instead:

marker_map = {
    "CD3": "CD3",
    "CD4": "CD4",
    "CD8": "CD8",
}

4. Read FCS files and apply basic preprocessing

Now we will build a reusable function that:

loads one FCS file
converts signal values to instrument-scaled fluorescence units
removes unstable start/end events
removes extreme values
performs a simple density gate on FSC-A and SSC-A
extracts the marker matrix
optionally downsamples very large files for faster iteration

import numpy as np

def preprocess_fcs(
    path,
    marker_map,
    scatter_channels=("FSC-A", "SSC-A"),
    gate_fraction=0.70,
    max_events=30000,
):
    fcs = FlowCal.io.FCSData(str(path))

    # Convert channels using metadata stored in the FCS file
    fcs = FlowCal.transform.to_rfi(fcs)

    # Remove early/late unstable acquisition events
    fcs = FlowCal.gate.start_end(fcs, num_start=300, num_end=100)

    # Remove saturated or extreme values
    fcs = FlowCal.gate.high_low(
        fcs,
        channels=list(marker_map.values()) + list(scatter_channels)
    )

    # Keep the densest cloud in FSC/SSC space
    gated = FlowCal.gate.density2d(
        fcs,
        channels=list(scatter_channels),
        gate_fraction=gate_fraction,
        full_output=True
    ).gated_data

    # Extract only marker channels for clustering
    X = np.asarray(gated[:, list(marker_map.values())], dtype=np.float32)
    df = pd.DataFrame(X, columns=list(marker_map.keys()))
    df["sample"] = path.stem

    # Optional downsampling for responsiveness
    if len(df) > max_events:
        df = df.sample(max_events, random_state=42)

    return df

Load all samples

all_events = pd.concat(
    [preprocess_fcs(path, marker_map) for path in fcs_files],
    ignore_index=True
)

print(all_events.shape)
print(all_events["sample"].value_counts())
all_events.head()

What is happening here

This gives you one combined table where:

each row is one event
each marker column is one measured feature
the sample column tells you which FCS file the event came from

5. Apply an arcsinh transform

Raw fluorescence values usually span a large dynamic range. An arcsinh transform compresses high values while keeping low-end structure interpretable.

A common starting point for fluorescence data is a cofactor around 150, but you should still inspect your marker distributions and adjust if needed.

X_raw = all_events[marker_names].to_numpy(dtype=np.float32)
X_asinh = np.arcsinh(X_raw / 150.0)

transformed = pd.DataFrame(X_asinh, columns=marker_names)
transformed["sample"] = all_events["sample"].values
transformed.head()

Quick sanity check

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 4))
sns.histplot(transformed["CD3"], bins=100)
plt.title("Arcsinh-transformed CD3")
plt.xlabel("asinh(signal / 150)")
plt.tight_layout()
plt.show()

6. Create an AnnData object

Using AnnData makes it easy to keep the event matrix together with sample metadata and clustering results.

import anndata as ad

adata = ad.AnnData(
    X=X_asinh,
    obs=all_events[["sample"]].copy(),
    var=pd.DataFrame(index=marker_names)
)

# Keep a copy of the transformed but unscaled data
adata.layers["asinh"] = adata.X.copy()

print(adata)

Why this is useful

This container lets you store:

the marker matrix in adata.X
per-event metadata in adata.obs
per-marker metadata in adata.var
embeddings and clustering results in the same object

7. Run PCA, build a neighbor graph, and cluster

A practical approach for high-dimensional FACS data is:

standardize markers
run PCA
build a nearest-neighbor graph
cluster the graph with Leiden
compute a UMAP embedding for visualization

import scanpy as sc

# Scale each marker to comparable range
sc.pp.scale(adata, max_value=10)

# Number of PCs should usually be smaller than or equal to marker count
n_pcs = min(10, adata.n_vars)

sc.pp.pca(adata, n_comps=n_pcs)
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=n_pcs, metric="euclidean")
sc.tl.leiden(adata, resolution=0.6, key_added="cluster")
sc.tl.umap(adata, min_dist=0.3)

print(adata.obs["cluster"].value_counts().sort_index())

What the main parameters do

n_neighbors: larger values make clusters smoother and less local
resolution: larger values usually produce more clusters
n_pcs: controls how much compressed signal you pass into graph construction

8. Visualize the clusters

A UMAP plot gives a quick view of the discovered structure.

embedding = pd.DataFrame(
    adata.obsm["X_umap"],
    columns=["UMAP1", "UMAP2"],
    index=adata.obs_names
)
embedding["cluster"] = adata.obs["cluster"].astype(str).values
embedding["sample"] = adata.obs["sample"].values

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=embedding,
    x="UMAP1",
    y="UMAP2",
    hue="cluster",
    s=6,
    linewidth=0,
    alpha=0.8,
    palette="tab20"
)
plt.title("UMAP of clustered FACS events")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="cluster")
plt.tight_layout()
plt.show()

Optional: color by sample

plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=embedding,
    x="UMAP1",
    y="UMAP2",
    hue="sample",
    s=6,
    linewidth=0,
    alpha=0.8
)
plt.title("UMAP colored by sample")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="sample")
plt.tight_layout()
plt.show()

9. Summarize clusters by marker expression

Clusters are only useful if you can interpret them biologically. A cluster-by-marker mean table is often the fastest way to annotate populations.

cluster_means = (
    pd.DataFrame(
        adata.layers["asinh"],
        columns=marker_names,
        index=adata.obs_names
    )
    .assign(cluster=adata.obs["cluster"].values)
    .groupby("cluster", observed=True)
    .mean()
    .sort_index()
)

cluster_means.head()

Plot a heatmap

plt.figure(figsize=(12, 5))
sns.heatmap(cluster_means, cmap="viridis")
plt.title("Mean marker expression per cluster")
plt.xlabel("Marker")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()

What to look for

high CD3 with high CD4 suggests CD4 T cells
high CD3 with high CD8 suggests CD8 T cells
high CD19 suggests B cells
high CD14 suggests monocytes
high CD56 with low CD3 suggests NK-like populations

Use your full panel and biology context when assigning names.

10. Compare cluster abundance across samples

This is where clustering becomes especially useful for experimental comparisons.

cluster_props = (
    adata.obs
    .assign(cluster=adata.obs["cluster"].astype(str).values)
    .groupby(["sample", "cluster"], observed=True)
    .size()
    .rename("n_events")
    .reset_index()
)

cluster_props["fraction"] = (
    cluster_props.groupby("sample")["n_events"]
    .transform(lambda s: s / s.sum())
)

prop_table = (
    cluster_props
    .pivot(index="sample", columns="cluster", values="fraction")
    .fillna(0)
    .sort_index()
)

print(prop_table.round(3))

Stacked bar chart

ax = prop_table.plot(
    kind="bar",
    stacked=True,
    figsize=(10, 4),
    colormap="tab20"
)
ax.set_ylabel("Fraction of events")
ax.set_title("Cluster composition by sample")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

11. Optionally annotate clusters with cell-type names

After reviewing the heatmap, you can attach human-readable names.

annotation_map = {
    "0": "CD4 T cells",
    "1": "CD8 T cells",
    "2": "B cells",
    "3": "Monocytes",
    "4": "NK cells",
}

adata.obs["cell_type"] = (
    adata.obs["cluster"]
    .astype(str)
    .map(annotation_map)
    .fillna("unassigned")
)

adata.obs[["sample", "cluster", "cell_type"]].head()

12. Export the results

Save both the full analysis object and simple tables for reporting.

from pathlib import Path

out_dir = Path("results")
out_dir.mkdir(exist_ok=True)

adata.write(out_dir / "facs_clusters.h5ad")
adata.obs.to_csv(out_dir / "event_metadata.csv", index=True)
cluster_means.to_csv(out_dir / "cluster_marker_means.csv")
prop_table.to_csv(out_dir / "cluster_proportions.csv")

print("Saved results to:", out_dir.resolve())

Recap

You now have a complete Python workflow for high-dimensional FACS clustering:

FCS files were loaded and lightly gated
marker intensities were transformed with arcsinh
events were clustered with a graph-based method
UMAP was used for visualization
cluster marker signatures and sample proportions were summarized
outputs were exported for downstream analysis

This pattern is a strong starting point for immunophenotyping, cohort comparisons, and exploratory discovery in multiparameter flow cytometry.

FAQ

1. My channel names do not match the tutorial. What should I change?

Change the values in marker_map so they match the detector channel names from your own FCS files. The easiest way is to print example.channels and example.channel_labels() first, then build your mapping from that output.

2. My clusters look sample-driven instead of biology-driven. What should I do?

This often means the data still contains batch effects or acquisition differences. Start by checking compensation, transform settings, gating consistency, and instrument drift. If you have bead-based calibration available, standardizing intensities before clustering can help a lot. Also inspect the UMAP colored by sample to see whether one run dominates the structure.

3. How do I choose the arcsinh cofactor and Leiden resolution?

Treat both as tunable analysis parameters. A cofactor of 150 is a reasonable starting point for fluorescence data, but inspect marker histograms and known populations. For Leiden, try a small grid such as 0.4, 0.6, 0.8, and 1.0, then choose the setting that gives stable, biologically interpretable groups rather than the largest possible number of clusters.

Why this matters

Summary

Requirements

1. Set up the environment

2. Inspect one FCS file and list channels

What is happening here

3. Define the marker panel

Tip

4. Read FCS files and apply basic preprocessing

Load all samples

What is happening here

5. Apply an arcsinh transform

Quick sanity check

6. Create an AnnData object

Why this is useful

7. Run PCA, build a neighbor graph, and cluster

What the main parameters do

8. Visualize the clusters

Optional: color by sample

9. Summarize clusters by marker expression

Plot a heatmap

What to look for

10. Compare cluster abundance across samples

Stacked bar chart

11. Optionally annotate clusters with cell-type names

12. Export the results

Recap

FAQ

1. My channel names do not match the tutorial. What should I change?

2. My clusters look sample-driven instead of biology-driven. What should I do?

3. How do I choose the arcsinh cofactor and Leiden resolution?

Further Reading

Related Posts