Note
This page was generated from AutoZI_tutorial.ipynb. Interactive online version: .
AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.
More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2
[1]:
import sys #if True, will install via pypi, else will install from source stable = True IN_COLAB = "google.colab" in sys.modules if IN_COLAB and stable: !pip install --quiet scvi-tools[tutorials] elif IN_COLAB and not stable: !pip install --quiet --upgrade jsonschema !pip install --quiet git+https://github.com/yoseflab/scvi-tools@master#egg=scvi-tools[tutorials]
|████████████████████████████████| 61kB 4.7MB/s ERROR: nbclient 0.5.0 has requirement jupyter-client>=6.1.5, but you'll have jupyter-client 5.3.5 which is incompatible. Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... done |████████████████████████████████| 153kB 9.4MB/s |████████████████████████████████| 122kB 9.8MB/s |████████████████████████████████| 112kB 18.4MB/s |████████████████████████████████| 51kB 7.1MB/s |████████████████████████████████| 3.2MB 20.3MB/s |████████████████████████████████| 2.4MB 47.2MB/s |████████████████████████████████| 7.7MB 50.1MB/s |████████████████████████████████| 8.7MB 53.5MB/s |████████████████████████████████| 51kB 7.9MB/s |████████████████████████████████| 112kB 47.5MB/s |████████████████████████████████| 51kB 8.0MB/s |████████████████████████████████| 61kB 9.2MB/s Building wheel for scvi-tools (PEP 517) ... done Building wheel for loompy (setup.py) ... done Building wheel for numpy-groupies (setup.py) ... done Building wheel for sinfo (setup.py) ... done
[2]:
import numpy as np import pandas as pd import anndata import scanpy as sc import scvi
[3]:
pbmc = scvi.data.pbmc_dataset(run_setup_anndata=False) pbmc.layers["counts"] = pbmc.X.copy() sc.pp.normalize_total(pbmc, target_sum=10e4) sc.pp.log1p(pbmc) pbmc.raw = pbmc scvi.data.poisson_gene_selection( pbmc, n_top_genes=1000, batch_key="batch", subset=True, layer="counts", ) scvi.data.setup_anndata( pbmc, labels_key="str_labels", batch_key="batch", layer="counts", )
INFO Downloading file at data/gene_info_pbmc.csv INFO Downloading file at data/pbmc_metadata.pickle INFO Downloading file at data/pbmc8k/filtered_gene_bc_matrices.tar.gz INFO Extracting tar file INFO Removing extracted data at data/pbmc8k/filtered_gene_bc_matrices INFO Downloading file at data/pbmc4k/filtered_gene_bc_matrices.tar.gz INFO Extracting tar file INFO Removing extracted data at data/pbmc4k/filtered_gene_bc_matrices Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 13324.32it/s] Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 13818.04it/s] INFO Using batches from adata.obs["batch"] INFO Using labels from adata.obs["str_labels"] INFO Using data from adata.layers["counts"] INFO Computing library size prior per batch INFO Successfully registered anndata object containing 11990 cells, 1000 genes, 2 batches, 9 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates. INFO Please do not further modify adata until model is trained.
In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior), \(\beta\) (beta_prior), \(\tau_{\text{dropout}}\) (minimal_dropout). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).
AutoZIVAE
alpha_prior
beta_prior
minimal_dropout
Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None
alpha_prior = None
beta_prior = None
[4]:
vae = scvi.model.AUTOZI(pbmc)
We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.
get_alphas_betas
[5]:
vae.train(lr=1e-2, n_epochs=200)
INFO KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup. INFO KL warmup for 400 epochs
INFO Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup. INFO Training time: 273 s. / 200 epochs
[6]:
outputs = vae.get_alphas_betas() alpha_posterior = outputs['alpha_posterior'] beta_posterior = outputs['beta_posterior']
Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.
[7]:
from scipy.stats import beta # Threshold (or Kzinb/Knb+Kzinb in paper) threshold = 0.5 # q(delta_g < 0.5) probabilities zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior) # ZI genes is_zi_pred = (zi_probs > threshold) print('Fraction of predicted ZI genes :', is_zi_pred.mean())
Fraction of predicted ZI genes : 0.354
We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.
[8]:
mask_sufficient_expression = (np.array(pbmc.X.mean(axis=0)) > 1.).reshape(-1) print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean()) print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())
Fraction of genes with avg expression > 1 : 0.498 Fraction of predicted ZI genes with avg expression > 1 : 0.25903614457831325
One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.
[9]:
# Model definition vae_genelabel = scvi.model.AUTOZI( pbmc, dispersion='gene-label', zero_inflation='gene-label' ) # Training vae_genelabel.train(lr=1e-2, n_epochs=200) # Retrieve posterior distribution parameters outputs_genelabel = vae_genelabel.get_alphas_betas() alpha_posterior_genelabel = outputs_genelabel['alpha_posterior'] beta_posterior_genelabel = outputs_genelabel['beta_posterior']
INFO Training is still in warming up phase. If your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup. INFO Training time: 274 s. / 200 epochs
[10]:
# q(delta_g < 0.5) probabilities zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel, beta_posterior_genelabel) # ZI gene-cell-types is_zi_pred_genelabel = (zi_probs_genelabel > threshold) ct = pbmc.obs.str_labels.astype("category") codes = np.unique(ct.cat.codes) cats = ct.cat.categories for ind_cell_type, cell_type in zip(codes, cats): is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type] print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type), is_zi_pred_genelabel_here.mean(),'\n')
Fraction of predicted ZI genes for cell type B cells : 0.481 Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.439 Fraction of predicted ZI genes for cell type CD4 T cells : 0.401 Fraction of predicted ZI genes for cell type CD8 T cells : 0.442 Fraction of predicted ZI genes for cell type Dendritic Cells : 0.418 Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.44 Fraction of predicted ZI genes for cell type Megakaryocytes : 0.456 Fraction of predicted ZI genes for cell type NK cells : 0.452 Fraction of predicted ZI genes for cell type Other : 0.428
[11]:
# With avg expressions > 1 for ind_cell_type, cell_type in zip(codes, cats): mask_sufficient_expression = (np.array(pbmc.X[pbmc.obs.str_labels.values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1) print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type), mask_sufficient_expression.mean()) is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression,ind_cell_type] print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type), is_zi_pred_genelabel_here.mean(), '\n')
Fraction of genes with avg expression > 1 for cell type B cells : 0.389 Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.4524421593830334 Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.489 Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.38445807770961143 Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.432 Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.27546296296296297 Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.498 Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.40160642570281124 Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.858 Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.41025641025641024 Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.739 Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.41542625169147496 Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.749 Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.44192256341789055 Fraction of genes with avg expression > 1 for cell type NK cells : 0.574 Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.44076655052264807 Fraction of genes with avg expression > 1 for cell type Other : 0.784 Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.4145408163265306