Using Python for Cell Clustering

Single-cell measurements are capable of examining the heterogeneity of cell populations within a single condition or across treatments. One way to quantify and analyze these heterogeneous subpopulations is to group the cell measurements into discrete clusters and then identify the cluster characteristics.

This tutorial describes how to use Python-based tools with AVITI24 cytoprofiling data to complete cell clustering using the Leiden algorithm. The end result of the tutorial is a Uniform Manifold Approximation and Projection (UMAP) graph that visualizes the heterogeneous cell populations and their assigned clusters in two dimensions.

Before You Begin

Make sure you have the following prerequisites for this tutorial:

Installation instructions for the cytoprofiling package are available at https://github.com/Elembio/cytoprofiling/tree/v1.0.0?tab=readme-ov-file#installation . To install the other required Python packages and enable all prerequisites, run the following commands in a CLI with Anaconda:

conda create --name cell_assignment python=3.10
conda activate cell_assignment
conda install -c conda-forge pandas scanpy leidenalg igraph anaconda ipykernel
python -m ipykernel install --user --name=cell_assignment

Load and Configure the Data

In the CLI terminal with Anaconda, run the following command to open Jupyter Notebook.

jupyter notebook

Select New, and then select Notebook.
In the drop-down menu, select the cell_assignment kernel.
To import required packages, copy the following code into the first cell.

import pandas as pd
import scanpy as sc
import cytoprofiling

Copy the following code into a second cell. Replace the file_path variable with the path to your RawCellStats.parquet file.
The code reads and normalizes the data for a RawCellStats.parqet file.

# example AVITI24 raw cell table AWS url
file_path = "C:\example\file\path"
# read data
df = pd.read_parquet(file_path)
# normalized per batch and filter
df = cytoprofiling.filter_cells(cytoprofiling.normalize_cytoprofiling(df))

To make the output data compatible with Scanpy, add the following code into the second cell.

# convert to anndata object
adata = cytoprofiling.cytoprofiling_to_anndata(df)

For general clustering using scanpy.tl.leiden, add the following command into the second cell.
The resolution parameter in the command determines the number of clusters. A higher value results in more clusters.

# cluster cells into subpopulations using the leiden algorithm
adata = cytoprofiling.cell_cycle_assignment.cluster_cells(adata, do_log=False, do_norm=False, resolution=1, targets="all")

Add the following example commands for running the UMAP embedding and plotting the UMAP by leiden clusters.

# run umap embedding
sc.tl.umap(adata)

# plot umap colored by leiden clusters
sc.pl.umap(
    adata,
    color=["leiden"]
)

Select the Run icon to test the workflow.

Interpreting the UMAP

The UMAP embedding projects all cell measurements from the RawCellStats.parquet file into two dimensions, visualizing the cell heterogeneity. Different colors indicate each cluster of cells that the Leiden algorithm assigns and represents similar cell subpopulations. Further analysis can quantitatively compare the characteristics underlying these clusters.

Example UMAP graph

You can use other Scanpy plotting methods to generate different graphs and analyses. For more information, see the Scanpy API documentation.

Before You Begin​

Load and Configure the Data​

Interpreting the UMAP​

Before You Begin

Load and Configure the Data

Interpreting the UMAP