Why this matters
High-dimensional FACS data can contain subtle immune or cell-state populations that are hard to isolate with manual gating alone. Once a panel grows beyond a handful of markers, unsupervised clustering becomes a practical way to find structure in the data, compare samples, and quantify population shifts.
In Python, you can build a clean workflow that reads FCS files, applies basic preprocessing, transforms marker intensities, clusters events, and visualizes the result in a low-dimensional embedding.
Summary
In this tutorial, you will:
- load multiple FCS files from disk
- inspect channel names and define a marker panel
- apply simple preprocessing and density-based gating
- build an event-by-marker matrix
- transform intensities with an arcsinh transform
- cluster cells with a graph-based workflow
- visualize clusters with UMAP
- summarize clusters by marker expression and sample abundance
Expected outcome: by the end, you will have cluster labels for each event, a UMAP plot, a cluster-by-marker heatmap, and per-sample cluster proportions you can use for downstream analysis.
Requirements
- Basic Python, NumPy, and pandas skills
- Python 3.10+
- A folder of FCS files with a consistent marker panel
- 8–16 GB RAM is usually enough for moderate datasets
- No GPU required
1. Set up the environment
We will use:
FlowCalfor reading FCS files and simple gatingpandasandnumpyfor tabular workanndataandscanpyfor clustering and visualizationmatplotlibandseabornfor plots
python -m venv .venv
source .venv/bin/activate
pip install flowcal scanpy anndata scikit-learn pandas numpy matplotlib seaborn igraph leidenalg
mkdir -p data results
Place your .fcs files inside the data/ folder.
2. Inspect one FCS file and list channels
Before clustering, inspect one file so you know the exact channel names stored in the FCS header. This is important because the detector names in the file may not match the marker labels you use in the lab.
from pathlib import Path
import pandas as pd
import FlowCal
data_dir = Path("data")
fcs_files = sorted(data_dir.glob("*.fcs"))
if not fcs_files:
raise FileNotFoundError("No .fcs files found in ./data")
example = FlowCal.io.FCSData(str(fcs_files[0]))
channel_table = pd.DataFrame({
"channel": example.channels,
"label": example.channel_labels()
})
print(channel_table)
print(f"Events: {example.shape[0]}")
print(f"Channels: {example.shape[1]}")
What is happening here
example.channelsshows the internal channel names used for indexingexample.channel_labels()often shows more human-readable marker labels- you should use this output to build a reliable mapping from marker names to detector channels
3. Define the marker panel
For clustering, use biological marker channels, not technical channels like Time. In most cases, scatter channels are useful for QC and gating, but not for the actual clustering matrix.
Update the detector names on the right-hand side to match your own FCS files.
marker_map = {
"CD3": "FL1-A",
"CD4": "FL2-A",
"CD8": "FL3-A",
"CD19": "FL4-A",
"CD14": "FL5-A",
"CD16": "FL6-A",
"CD56": "FL7-A",
"HLA_DR": "FL8-A",
"CD45RA": "FL9-A",
"CCR7": "FL10-A",
"CD25": "FL11-A",
"CD127": "FL12-A",
}
scatter_channels = ("FSC-A", "SSC-A")
marker_names = list(marker_map.keys())
detector_channels = list(marker_map.values())
Tip
If your files already use marker names directly as channel names, your map may look like this instead:
marker_map = {
"CD3": "CD3",
"CD4": "CD4",
"CD8": "CD8",
}
4. Read FCS files and apply basic preprocessing
Now we will build a reusable function that:
- loads one FCS file
- converts signal values to instrument-scaled fluorescence units
- removes unstable start/end events
- removes extreme values
- performs a simple density gate on
FSC-AandSSC-A - extracts the marker matrix
- optionally downsamples very large files for faster iteration
import numpy as np
def preprocess_fcs(
path,
marker_map,
scatter_channels=("FSC-A", "SSC-A"),
gate_fraction=0.70,
max_events=30000,
):
fcs = FlowCal.io.FCSData(str(path))
# Convert channels using metadata stored in the FCS file
fcs = FlowCal.transform.to_rfi(fcs)
# Remove early/late unstable acquisition events
fcs = FlowCal.gate.start_end(fcs, num_start=300, num_end=100)
# Remove saturated or extreme values
fcs = FlowCal.gate.high_low(
fcs,
channels=list(marker_map.values()) + list(scatter_channels)
)
# Keep the densest cloud in FSC/SSC space
gated = FlowCal.gate.density2d(
fcs,
channels=list(scatter_channels),
gate_fraction=gate_fraction,
full_output=True
).gated_data
# Extract only marker channels for clustering
X = np.asarray(gated[:, list(marker_map.values())], dtype=np.float32)
df = pd.DataFrame(X, columns=list(marker_map.keys()))
df["sample"] = path.stem
# Optional downsampling for responsiveness
if len(df) > max_events:
df = df.sample(max_events, random_state=42)
return df
Load all samples
all_events = pd.concat(
[preprocess_fcs(path, marker_map) for path in fcs_files],
ignore_index=True
)
print(all_events.shape)
print(all_events["sample"].value_counts())
all_events.head()
What is happening here
This gives you one combined table where:
- each row is one event
- each marker column is one measured feature
- the
samplecolumn tells you which FCS file the event came from
5. Apply an arcsinh transform
Raw fluorescence values usually span a large dynamic range. An arcsinh transform compresses high values while keeping low-end structure interpretable.
A common starting point for fluorescence data is a cofactor around 150, but you should still inspect your marker distributions and adjust if needed.
X_raw = all_events[marker_names].to_numpy(dtype=np.float32)
X_asinh = np.arcsinh(X_raw / 150.0)
transformed = pd.DataFrame(X_asinh, columns=marker_names)
transformed["sample"] = all_events["sample"].values
transformed.head()
Quick sanity check
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 4))
sns.histplot(transformed["CD3"], bins=100)
plt.title("Arcsinh-transformed CD3")
plt.xlabel("asinh(signal / 150)")
plt.tight_layout()
plt.show()
6. Create an AnnData object
Using AnnData makes it easy to keep the event matrix together with sample metadata and clustering results.
import anndata as ad
adata = ad.AnnData(
X=X_asinh,
obs=all_events[["sample"]].copy(),
var=pd.DataFrame(index=marker_names)
)
# Keep a copy of the transformed but unscaled data
adata.layers["asinh"] = adata.X.copy()
print(adata)
Why this is useful
This container lets you store:
- the marker matrix in
adata.X - per-event metadata in
adata.obs - per-marker metadata in
adata.var - embeddings and clustering results in the same object
7. Run PCA, build a neighbor graph, and cluster
A practical approach for high-dimensional FACS data is:
- standardize markers
- run PCA
- build a nearest-neighbor graph
- cluster the graph with Leiden
- compute a UMAP embedding for visualization
import scanpy as sc
# Scale each marker to comparable range
sc.pp.scale(adata, max_value=10)
# Number of PCs should usually be smaller than or equal to marker count
n_pcs = min(10, adata.n_vars)
sc.pp.pca(adata, n_comps=n_pcs)
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=n_pcs, metric="euclidean")
sc.tl.leiden(adata, resolution=0.6, key_added="cluster")
sc.tl.umap(adata, min_dist=0.3)
print(adata.obs["cluster"].value_counts().sort_index())
What the main parameters do
n_neighbors: larger values make clusters smoother and less localresolution: larger values usually produce more clustersn_pcs: controls how much compressed signal you pass into graph construction
8. Visualize the clusters
A UMAP plot gives a quick view of the discovered structure.
embedding = pd.DataFrame(
adata.obsm["X_umap"],
columns=["UMAP1", "UMAP2"],
index=adata.obs_names
)
embedding["cluster"] = adata.obs["cluster"].astype(str).values
embedding["sample"] = adata.obs["sample"].values
plt.figure(figsize=(8, 6))
sns.scatterplot(
data=embedding,
x="UMAP1",
y="UMAP2",
hue="cluster",
s=6,
linewidth=0,
alpha=0.8,
palette="tab20"
)
plt.title("UMAP of clustered FACS events")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="cluster")
plt.tight_layout()
plt.show()
Optional: color by sample
plt.figure(figsize=(8, 6))
sns.scatterplot(
data=embedding,
x="UMAP1",
y="UMAP2",
hue="sample",
s=6,
linewidth=0,
alpha=0.8
)
plt.title("UMAP colored by sample")
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="sample")
plt.tight_layout()
plt.show()
9. Summarize clusters by marker expression
Clusters are only useful if you can interpret them biologically. A cluster-by-marker mean table is often the fastest way to annotate populations.
cluster_means = (
pd.DataFrame(
adata.layers["asinh"],
columns=marker_names,
index=adata.obs_names
)
.assign(cluster=adata.obs["cluster"].values)
.groupby("cluster", observed=True)
.mean()
.sort_index()
)
cluster_means.head()
Plot a heatmap
plt.figure(figsize=(12, 5))
sns.heatmap(cluster_means, cmap="viridis")
plt.title("Mean marker expression per cluster")
plt.xlabel("Marker")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()
What to look for
- high
CD3with highCD4suggests CD4 T cells - high
CD3with highCD8suggests CD8 T cells - high
CD19suggests B cells - high
CD14suggests monocytes - high
CD56with lowCD3suggests NK-like populations
Use your full panel and biology context when assigning names.
10. Compare cluster abundance across samples
This is where clustering becomes especially useful for experimental comparisons.
cluster_props = (
adata.obs
.assign(cluster=adata.obs["cluster"].astype(str).values)
.groupby(["sample", "cluster"], observed=True)
.size()
.rename("n_events")
.reset_index()
)
cluster_props["fraction"] = (
cluster_props.groupby("sample")["n_events"]
.transform(lambda s: s / s.sum())
)
prop_table = (
cluster_props
.pivot(index="sample", columns="cluster", values="fraction")
.fillna(0)
.sort_index()
)
print(prop_table.round(3))
Stacked bar chart
ax = prop_table.plot(
kind="bar",
stacked=True,
figsize=(10, 4),
colormap="tab20"
)
ax.set_ylabel("Fraction of events")
ax.set_title("Cluster composition by sample")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
11. Optionally annotate clusters with cell-type names
After reviewing the heatmap, you can attach human-readable names.
annotation_map = {
"0": "CD4 T cells",
"1": "CD8 T cells",
"2": "B cells",
"3": "Monocytes",
"4": "NK cells",
}
adata.obs["cell_type"] = (
adata.obs["cluster"]
.astype(str)
.map(annotation_map)
.fillna("unassigned")
)
adata.obs[["sample", "cluster", "cell_type"]].head()
12. Export the results
Save both the full analysis object and simple tables for reporting.
from pathlib import Path
out_dir = Path("results")
out_dir.mkdir(exist_ok=True)
adata.write(out_dir / "facs_clusters.h5ad")
adata.obs.to_csv(out_dir / "event_metadata.csv", index=True)
cluster_means.to_csv(out_dir / "cluster_marker_means.csv")
prop_table.to_csv(out_dir / "cluster_proportions.csv")
print("Saved results to:", out_dir.resolve())
Recap
You now have a complete Python workflow for high-dimensional FACS clustering:
- FCS files were loaded and lightly gated
- marker intensities were transformed with arcsinh
- events were clustered with a graph-based method
- UMAP was used for visualization
- cluster marker signatures and sample proportions were summarized
- outputs were exported for downstream analysis
This pattern is a strong starting point for immunophenotyping, cohort comparisons, and exploratory discovery in multiparameter flow cytometry.
FAQ
1. My channel names do not match the tutorial. What should I change?
Change the values in marker_map so they match the detector channel names from your own FCS files. The easiest way is to print example.channels and example.channel_labels() first, then build your mapping from that output.
2. My clusters look sample-driven instead of biology-driven. What should I do?
This often means the data still contains batch effects or acquisition differences. Start by checking compensation, transform settings, gating consistency, and instrument drift. If you have bead-based calibration available, standardizing intensities before clustering can help a lot. Also inspect the UMAP colored by sample to see whether one run dominates the structure.
3. How do I choose the arcsinh cofactor and Leiden resolution?
Treat both as tunable analysis parameters. A cofactor of 150 is a reasonable starting point for fluorescence data, but inspect marker histograms and known populations. For Leiden, try a small grid such as 0.4, 0.6, 0.8, and 1.0, then choose the setting that gives stable, biologically interpretable groups rather than the largest possible number of clusters.
Further Reading
- FlowCal documentation — FCS reading, transforms, and gating.
- Scanpy documentation — PCA, neighbors, Leiden clustering, and UMAP.
- AnnData documentation — annotated matrix structure for event-level data.
- UMAP-learn documentation — background and parameter intuition for embeddings.
- scikit-learn mixture models — useful if you want to compare Leiden with probabilistic clustering.

