Identification of zero-inflated genes

AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.

More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2

Open In Colab

[1]:
import sys

#if True, will install via pypi, else will install from source
stable = True
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and stable:
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and not stable:
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@master#egg=scvi-tools[tutorials]

     |████████████████████████████████| 61kB 4.7MB/s
ERROR: nbclient 0.5.0 has requirement jupyter-client>=6.1.5, but you'll have jupyter-client 5.3.5 which is incompatible.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
     |████████████████████████████████| 153kB 9.4MB/s
     |████████████████████████████████| 122kB 9.8MB/s
     |████████████████████████████████| 112kB 18.4MB/s
     |████████████████████████████████| 51kB 7.1MB/s
     |████████████████████████████████| 3.2MB 20.3MB/s
     |████████████████████████████████| 2.4MB 47.2MB/s
     |████████████████████████████████| 7.7MB 50.1MB/s
     |████████████████████████████████| 8.7MB 53.5MB/s
     |████████████████████████████████| 51kB 7.9MB/s
     |████████████████████████████████| 112kB 47.5MB/s
     |████████████████████████████████| 51kB 8.0MB/s
     |████████████████████████████████| 61kB 9.2MB/s
  Building wheel for scvi-tools (PEP 517) ... done
  Building wheel for loompy (setup.py) ... done
  Building wheel for numpy-groupies (setup.py) ... done
  Building wheel for sinfo (setup.py) ... done

Imports, data loading and preparation

[2]:
import numpy as np
import pandas as pd
import anndata

import scanpy as sc
import scvi
[3]:
pbmc = scvi.data.pbmc_dataset(run_setup_anndata=False)
pbmc.layers["counts"] = pbmc.X.copy()
sc.pp.normalize_total(pbmc, target_sum=10e4)
sc.pp.log1p(pbmc)
pbmc.raw = pbmc
scvi.data.poisson_gene_selection(
    pbmc,
    n_top_genes=1000,
    batch_key="batch",
    subset=True,
    layer="counts",
)
scvi.data.setup_anndata(
    pbmc,
    labels_key="str_labels",
    batch_key="batch",
    layer="counts",
)
INFO      Downloading file at data/gene_info_pbmc.csv
INFO      Downloading file at data/pbmc_metadata.pickle
INFO      Downloading file at data/pbmc8k/filtered_gene_bc_matrices.tar.gz
INFO      Extracting tar file
INFO      Removing extracted data at data/pbmc8k/filtered_gene_bc_matrices
INFO      Downloading file at data/pbmc4k/filtered_gene_bc_matrices.tar.gz
INFO      Extracting tar file
INFO      Removing extracted data at data/pbmc4k/filtered_gene_bc_matrices
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 13324.32it/s]
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 13818.04it/s]
INFO      Using batches from adata.obs["batch"]
INFO      Using labels from adata.obs["str_labels"]
INFO      Using data from adata.layers["counts"]
INFO      Computing library size prior per batch
INFO      Successfully registered anndata object containing 11990 cells, 1000
          genes, 2 batches, 9 labels, and 0 proteins. Also registered 0 extra
          categorical covariates and 0 extra continuous covariates.
INFO      Please do not further modify adata until model is trained.

Analyze gene-specific ZI

In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior), \(\beta\) (beta_prior), \(\tau_{\text{dropout}}\) (minimal_dropout). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).

Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None

[4]:
vae = scvi.model.AUTOZI(pbmc)

We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.

[5]:
vae.train(lr=1e-2, n_epochs=200)
INFO      KL warmup phase exceeds overall training phaseIf your applications
          rely on the posterior quality, consider training for more epochs or
          reducing the kl warmup.
INFO      KL warmup for 400 epochs
INFO      Training is still in warming up phase. If your applications rely on
          the posterior quality, consider training for more epochs or reducing
          the kl warmup.
INFO      Training time:  273 s. / 200 epochs
[6]:
outputs = vae.get_alphas_betas()
alpha_posterior = outputs['alpha_posterior']
beta_posterior = outputs['beta_posterior']

Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.

[7]:
from scipy.stats import beta

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)

# ZI genes
is_zi_pred = (zi_probs > threshold)

print('Fraction of predicted ZI genes :', is_zi_pred.mean())
Fraction of predicted ZI genes : 0.354

We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

[8]:
mask_sufficient_expression = (np.array(pbmc.X.mean(axis=0)) > 1.).reshape(-1)
print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean())
print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())
Fraction of genes with avg expression > 1 : 0.498
Fraction of predicted ZI genes with avg expression > 1 : 0.25903614457831325

Analyze gene-cell-type-specific ZI

One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.

[9]:
# Model definition
vae_genelabel = scvi.model.AUTOZI(
    pbmc,
    dispersion='gene-label',
    zero_inflation='gene-label'
)

# Training
vae_genelabel.train(lr=1e-2, n_epochs=200)

# Retrieve posterior distribution parameters
outputs_genelabel = vae_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel['alpha_posterior']
beta_posterior_genelabel = outputs_genelabel['beta_posterior']
INFO      KL warmup phase exceeds overall training phaseIf your applications
          rely on the posterior quality, consider training for more epochs or
          reducing the kl warmup.
INFO      KL warmup for 400 epochs
INFO      Training is still in warming up phase. If your applications rely on
          the posterior quality, consider training for more epochs or reducing
          the kl warmup.
INFO      Training time:  274 s. / 200 epochs
[10]:
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel, beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = (zi_probs_genelabel > threshold)

ct = pbmc.obs.str_labels.astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats):
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type]
    print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(),'\n')
Fraction of predicted ZI genes for cell type B cells : 0.481

Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.439

Fraction of predicted ZI genes for cell type CD4 T cells : 0.401

Fraction of predicted ZI genes for cell type CD8 T cells : 0.442

Fraction of predicted ZI genes for cell type Dendritic Cells : 0.418

Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.44

Fraction of predicted ZI genes for cell type Megakaryocytes : 0.456

Fraction of predicted ZI genes for cell type NK cells : 0.452

Fraction of predicted ZI genes for cell type Other : 0.428

[11]:
# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats):
    mask_sufficient_expression = (np.array(pbmc.X[pbmc.obs.str_labels.values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1)
    print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type),
          mask_sufficient_expression.mean())
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression,ind_cell_type]
    print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type),
          is_zi_pred_genelabel_here.mean(), '\n')
Fraction of genes with avg expression > 1 for cell type B cells : 0.389
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.4524421593830334

Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.489
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.38445807770961143

Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.432
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.27546296296296297

Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.498
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.40160642570281124

Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.858
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.41025641025641024

Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.739
Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.41542625169147496

Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.749
Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.44192256341789055

Fraction of genes with avg expression > 1 for cell type NK cells : 0.574
Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.44076655052264807

Fraction of genes with avg expression > 1 for cell type Other : 0.784
Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.4145408163265306