Using Python for Cell Clustering
Single-cell measurements are capable of examining the heterogeneity of cell populations within a single condition or across treatments. One way to quantify and analyze these heterogeneous subpopulations is to group the cell measurements into discrete clusters and then identify the cluster characteristics.
This tutorial describes how to use Python-based tools with AVITI24 cytoprofiling data to complete cell clustering using the Leiden algorithm. The end result of the tutorial is a Uniform Manifold Approximation and Projection (UMAP) graph that visualizes the heterogeneous cell populations and their assigned clusters in two dimensions.
Before You Begin
Make sure you have the following prerequisites for this tutorial:
- Python v3 or later
- Anaconda Distribution software
- Jupyter
- The following required Python packages:
Installation instructions for the cytoprofiling package are available at https://github.com/Elembio/cytoprofiling/tree/v1.0.0?tab=readme-ov-file#installation . To install the other required Python packages and enable all prerequisites, run the following commands in a CLI with Anaconda:
conda create --name cell_assignment python=3.10
conda activate cell_assignment
conda install -c conda-forge pandas scanpy leidenalg igraph anaconda ipykernel
python -m ipykernel install --user --name=cell_assignment
Load and Configure the Data
- In the CLI terminal with Anaconda, run the following command to open Jupyter Notebook.
jupyter notebook
- Select New, and then select Notebook.
- In the drop-down menu, select the cell_assignment kernel.
- To import required packages, copy the following code into the first cell.
import pandas as pd
import scanpy as sc
import cytoprofiling
- Copy the following code into a second cell. Replace the
file_path
variable with the path to yourRawCellStats.parquet
file.The code reads and normalizes the data for a
RawCellStats.parqet
file.
# example AVITI24 raw cell table AWS url
file_path = "C:\example\file\path"
# read data
df = pd.read_parquet(file_path)
# normalized per batch and filter
df = cytoprofiling.filter_cells(cytoprofiling.normalize_cytoprofiling(df))
- To make the output data compatible with Scanpy, add the following code into the second cell.
# convert to anndata object
adata = cytoprofiling.cytoprofiling_to_anndata(df)
- For general clustering using
scanpy.tl.leiden
, add the following command into the second cell.The
resolution
parameter in the command determines the number of clusters. A higher value results in more clusters.
# cluster cells into subpopulations using the leiden algorithm
adata = cytoprofiling.cell_cycle_assignment.cluster_cells(adata, do_log=False, do_norm=False, resolution=1, targets="all")
- Add the following example commands for running the UMAP embedding and plotting the UMAP by leiden clusters.
# run umap embedding
sc.tl.umap(adata)
# plot umap colored by leiden clusters
sc.pl.umap(
adata,
color=["leiden"]
)
- Select the Run icon to test the workflow.
Interpreting the UMAP
The UMAP embedding projects all cell measurements from the RawCellStats.parquet
file into two dimensions, visualizing the cell heterogeneity. Different colors indicate each cluster of cells that the Leiden algorithm assigns and represents similar cell subpopulations. Further analysis can quantitatively compare the characteristics underlying these clusters.
You can use other Scanpy plotting methods to generate different graphs and analyses. For more information, see the Scanpy API documentation.