Clustering 3K PBMCs

In May 2017, this started out as a demonstration that Scanpy would allow to reproduce most of Seurat’s (Satija et al., 2015) guided clustering tutorial. We gratefully acknowledge the authors of Seurat for the tutorial. In the meanwhile, we have added and removed a few pieces.

The data consists in 3K PBMCs from a Healthy Donor and is freely available from 10x Genomics (here from this webpage). On a unix system, you can uncomment and run the following to download and unpack the data

In [1]:
# !mkdir data
# !wget -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz


Download the notebook by clicking on the Edit on GitHub button. On GitHub, you can download using the Raw button via right-click and Save Link As. Alternatively, download the whole scanpy-tutorial repository.


In Jupyter notebooks and lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.

In [2]:
import numpy as np
import pandas as pd
import scanpy as sc

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
results_file = './write/pbmc3k.h5ad'  # the file that will store the analysis results
scanpy==1.3.7+63.g16c160f anndata==0.6.17+1.ga0cd0c6 numpy==1.14.6 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
In [3]:
In [4]:
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True)                                # write a cache file for faster subsequent reading
... reading from cache file ./cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
In [5]:
adata.var_names_make_unique()  # this is unnecessary if using 'gene_ids'
In [6]:
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'


Show those genes that yield the highest fraction of counts in each single cells, across all cells.

In [7]:, n_top=20)

Basic filtering.

In [8]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells

Let us assemple some information about mitochondrial genes, which are important for quality control.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):

High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

Note you can also use the function pp.calculate_qc_metrics to compute the fraction of mitochondrial genes and additional measures.

In [9]:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

A violin plot of the computed quality measures.

In [10]:, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

Remove cells that have too many mitochondrial genes expressed or too many total counts.

In [11]:, x='n_counts', y='percent_mito'), x='n_counts', y='n_genes')
In [12]:
AnnData object with n_obs × n_vars = 2700 × 13714
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells'

Actually do the filtering.

In [13]:
adata = adata[adata.obs['n_genes'] < 2500, :]
adata = adata[adata.obs['percent_mito'] < 0.05, :]

Total-count normalize (library-size correct) the data matrix \(\mathbf{X}\) to 10,000 reads per cell, so that counts become comparable among cells.

In [14]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

Logarithmize the data.

In [15]:

Set the .raw attribute of AnnData object to the logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object. While many people consider the normalized data matrix as the “relevant data” for visualization and differential testing, some would prefer to store the unnormalized data.

In [16]:
adata.raw = adata

Identify highly-variable genes.

In [17]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
In [18]:

Actually do the filtering.

In [19]:
adata = adata[:, adata.var['highly_variable']]

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.

In [20]:
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
regressing out ['n_counts', 'percent_mito']
    sparse input is densified and may lead to high memory use
    finished (0:00:04.33)

Scale each gene to unit variance. Clip values exceeding standard deviation 10.

In [21]:
sc.pp.scale(adata, max_value=10)

Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.

In [22]:, svd_solver='arpack')

We can make a scatter plot in the PCA coordinates, but we will not use that later on.

In [23]:, color='CST3')

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function or tSNE In our experience, often, a rough estimate of the number of PCs does fine.

In [24]:, log=True)

Save the result.

In [25]:
In [26]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing the neighborhood graph

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat’s results, let’s take the following values.

In [27]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished (0:00:02.69) --> added to `.uns['neighbors']`
    'distances', weighted adjacency matrix
    'connectivities', weighted adjacency matrix

Embedding the neighborhood graph

We advertise embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

pl.paga(adata, plot=False)  # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
In [28]:
computing UMAP
    finished (0:00:03.54) --> added
    'X_umap', UMAP coordinates (adata.obsm)
In [29]:, color=['CST3', 'NKG7', 'PPBP'])

As we set the .raw attribute of adata, the previous plots showed the “raw” (normalized, logarithmized, but uncorrected) gene expression. You can also plot the scaled and corrected gene expression by explicitly stating that you don’t want to use .raw.

In [30]:, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

Clustering the neighborhood graph

As Seurat and many others, we recommend the Louvain graph-clustering method (community detection based on optimizing modularity). It has been proposed for single-cell data by Levine et al. (2015). Note that Louvain clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [31]:
running Louvain clustering
    using the "louvain" package of Traag (2017)
    finished (0:00:00.18) --> found 8 clusters and added
    'louvain', the cluster labels (adata.obs, categorical)

Plot the clusters, which agree quite well with the result of Seurat.

In [32]:, color=['louvain', 'CST3', 'NKG7'])

Save the result.

In [33]:

Finding marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

In [34]:, 'louvain', method='t-test'), n_genes=25, sharey=False)
ranking genes
    finished (0:00:00.52) --> added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids
In [35]:
sc.settings.verbosity = 2  # reduce the verbosity

The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar. We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy.

In [36]:, 'louvain', method='wilcoxon'), n_genes=25, sharey=False)
ranking genes
    finished (0:00:03.05)

Save the result.

In [37]:

As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.

In [38]:, 'louvain', method='logreg'), n_genes=25, sharey=False)
ranking genes
    finished (0:00:02.18)

With the exceptions of IL7R, which is only found by the t-test and FCER1A, which is only found by the other two appraoches, all marker genes are recovered in all approaches.

Louvain Group Markers Cell Type
0 IL7R CD4 T cells
1 CD14, LYZ CD14+ Monocytes
2 MS4A1 B cells
3 CD8A CD8 T cells
4 GNLY, NKG7 NK cells
5 FCGR3A, MS4A7 FCGR3A+ Monocytes
6 FCER1A, CST3 Dendritic Cells
7 PPBP Megakaryocytes

Let us also define a list of marker genes for later reference.

In [39]:
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

Reload the object that has been save with the Wilcoxon Rank-Sum test result.

In [40]:
adata =

Show the 10 top ranked genes per cluster 0, 1, …, 7 in a dataframe.

In [41]:
0 1 2 3 4 5 6 7

Get a table with the scores and groups.

In [42]:
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)
0_n 0_p 1_n 1_p 2_n 2_p 3_n 3_p 4_n 4_p 5_n 5_p 6_n 6_p 7_n 7_p
0 RPS12 2.558687e-224 LYZ 9.570840e-250 CD74 2.487145e-183 CCL5 2.432069e-124 NKG7 7.238180e-93 LST1 1.229764e-88 HLA-DPA1 5.422417e-21 PF4 2.264925e-11
1 LDHB 2.944386e-218 S100A9 7.389959e-249 CD79A 1.679730e-170 NKG7 2.002395e-107 GNLY 3.627599e-87 FCER1G 3.430812e-84 HLA-DPB1 7.591860e-21 PPBP 2.264925e-11
2 RPS25 2.118767e-200 S100A8 7.549245e-241 HLA-DRA 6.949695e-167 CST7 2.524340e-82 GZMB 5.161534e-85 AIF1 5.144661e-82 HLA-DRA 1.306768e-19 CALM3 3.591608e-11
3 RPS27 8.086262e-185 TYROBP 1.131467e-222 CD79B 2.569135e-154 B2M 6.714479e-81 CTSW 2.012053e-83 COTL1 4.761217e-81 HLA-DRB1 1.865104e-19 GPX1 6.352631e-11
4 RPS6 1.547193e-183 FTL 2.689949e-215 HLA-DPB1 3.580735e-148 IL32 6.623479e-74 PRF1 1.353268e-82 FCGR3A 8.812368e-79 CD74 5.853161e-19 MYL6 2.432291e-10

Compare to a single cluster.

In [43]:, 'louvain', groups=['0'], reference='1', method='wilcoxon'), groups=['0'], n_genes=20)
ranking genes
    finished (0:00:01.44)

If we want a more detailed view for a certain group, use

In [44]:, groups='0', n_genes=8)

Reload the object that computed differential expression by comparing to the rest of the groups.

In [45]:
adata =
In [46]:, groups='0', n_genes=8)

If you want to compare a certain gene across groups, use the following.

In [47]:, ['CST3', 'NKG7', 'PPBP'], groupby='louvain')

Actually mark the cell types.

In [48]:
new_cluster_names = [
    'CD4 T', 'CD14+ Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A+ Monocytes',
    'Dendritic', 'Megakaryocytes']
adata.rename_categories('louvain', new_cluster_names)
In [49]:, color='louvain', legend_loc='on data', title='', frameon=False, save='.pdf')
saving figure to file ./figures/umap.pdf

Now that we annotated the cell types, let us visualize the marker genes.

In [50]:
ax =, marker_genes, groupby='louvain')

There is also a very compact violin plot.

In [51]:
ax =, marker_genes, groupby='louvain', rotation=90)

During the course of this analysis, the AnnData accumlated the following annotations.

In [52]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
In [53]:
adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.

If you want to share this file with people who merely want to use it for visualization, a simple way to reduce the file size is by removing the dense scaled and corrected data matrix. The file still contains the raw data used in the visualizations.

In [54]:
adata.X = None
adata.write('./write/pbmc3k_withoutX.h5ad', compression='gzip')

If you want to export to “csv”, you have the following options:

In [55]:
# Export single fields of the annotation of observations
# adata.obs[['n_counts', 'louvain_groups']].to_csv(
#     './write/pbmc3k_corrected_louvain_groups.csv')

# Export single columns of the multidimensional annotation
# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
#     './write/pbmc3k_corrected_X_pca.csv')

# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
# adata.write_csvs(results_file[:-5], )