Note

This page was generated from scarches_scvi_tools.ipynb. Interactive online version: Colab badge.

Online update of scvi-tools models with query datasets

This tutorial covers the usage of the scArches method with SCVI, SCANVI, and TOTALVI.

This particular workflow is useful in the case where a model is trained on some data (called reference here) and new samples are received (called query). The goal is to analyze these samples in the context of the reference, by mapping the query cells to the same reference latent space. This workflow may also be used in the scarches package, but here we demonstrate using only scvi-tools.

Imports and scvi-tools installation (colab)

[1]:
import sys

# If True, will install via pypi, else will install from source
stable = True
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and stable:
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and not stable:
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@master#egg=scvi-tools[tutorials]

if IN_COLAB:
    !pip install --quiet scrublet
     |████████████████████████████████| 61kB 6.4MB/s
ERROR: nbclient 0.5.1 has requirement jupyter-client>=6.1.5, but you'll have jupyter-client 5.3.5 which is incompatible.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
     |████████████████████████████████| 122kB 14.3MB/s
     |████████████████████████████████| 245kB 21.6MB/s
     |████████████████████████████████| 184kB 27.3MB/s
     |████████████████████████████████| 2.4MB 27.3MB/s
     |████████████████████████████████| 51kB 7.4MB/s
     |████████████████████████████████| 8.7MB 25.9MB/s
     |████████████████████████████████| 3.2MB 57.6MB/s
     |████████████████████████████████| 7.7MB 23.7MB/s
     |████████████████████████████████| 51kB 7.2MB/s
     |████████████████████████████████| 112kB 60.0MB/s
     |████████████████████████████████| 71kB 10.2MB/s
     |████████████████████████████████| 51kB 7.8MB/s
  Building wheel for scvi-tools (PEP 517) ... done
  Building wheel for loompy (setup.py) ... done
  Building wheel for numpy-groupies (setup.py) ... done
  Building wheel for sinfo (setup.py) ... done
     |████████████████████████████████| 655kB 12.9MB/s
  Building wheel for annoy (setup.py) ... done
[2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import anndata
import scvi
import scanpy as sc

sc.set_figure_params(figsize=(4, 4))

Online update of SCVI

Here we use the pancreas dataset described in the scArches manuscript, that is also widely used to benchmark integration methods.

[3]:
!gdown --id 1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd --output pancreas.h5ad
Downloading...
From: https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd
To: /content/pancreas.h5ad
126MB [00:00, 305MB/s]
[4]:
adata_all = sc.read("pancreas.h5ad")
adata = adata_all.raw.to_adata()
print(adata)
AnnData object with n_obs × n_vars = 15681 × 1000
    obs: 'batch', 'study', 'cell_type', 'size_factors'
[5]:
adata.obs.study.value_counts()
[5]:
Pancreas inDrop         8391
Pancreas SS2            2961
Pancreas CelSeq2        2426
Pancreas CelSeq         1271
Pancreas Fluidigm C1     632
Name: study, dtype: int64

We consider the SS2 and CelSeq2 samples as query, and all the others as reference.

[6]:
query = np.array([s in ["Pancreas SS2", "Pancreas CelSeq2"] for s in adata.obs.study])

adata_ref = adata[~query].copy()
adata_query = adata[query].copy()

Train reference

We train the reference using the standard SCVI workflow, except we add a few non-default parameters that were identified to work well with scArches.

[7]:
scvi.data.setup_anndata(adata_ref, batch_key="study")
INFO     Using batches from adata.obs["study"]
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 10294 cells, 1000
         vars, 3 batches, 1 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
INFO     Please do not further modify adata until model is trained.
[8]:
arches_params = dict(
    use_layer_norm="both",
    use_batch_norm="none",
    encode_covariates=True,
    dropout_rate=0.2,
    n_layers=2,
)

vae_ref = scvi.model.SCVI(
    adata_ref,
    **arches_params
)
vae_ref.train()
INFO     Training for 400 epochs
INFO     KL warmup for 400 epochs
Training...:   0%|          | 0/400 [00:00<?, ?it/s]
/usr/local/lib/python3.6/dist-packages/scvi/core/distributions/_negative_binomial.py:434: UserWarning: The value argument must be within the support of the distribution
  UserWarning,
Training...: 100%|██████████| 400/400 [04:49<00:00,  1.38it/s]
INFO     Training time:  289 s. / 400 epochs

Now we obtain the latent representation, and use Scanpy to visualize with UMAP.

[9]:
adata_ref.obsm["X_scVI"] = vae_ref.get_latent_representation()
sc.pp.neighbors(adata_ref, use_rep="X_scVI")
sc.tl.leiden(adata_ref)
sc.tl.umap(adata_ref)
[10]:
sc.pl.umap(
    adata_ref,
    color=["study", "cell_type"],
    frameon=False,
    ncols=1,
)
../../_images/user_guide_notebooks_scarches_scvi_tools_17_0.png

Online update with query

We can load a new model with the query data either using

  1. The saved reference model

  2. The instantiation of the reference model in memory

[11]:
# save the reference model
dir_path = "pancreas_model/"
vae_ref.save(dir_path, overwrite=True)
[12]:
# both are valid
vae_q = scvi.model.SCVI.load_query_data(
    adata_query,
    dir_path,
)
vae_q = scvi.model.SCVI.load_query_data(
    adata_query,
    vae_ref,
)
INFO     .obs[_scvi_labels] not found in target, assuming every cell is same
         category
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels']
INFO     Successfully registered anndata object containing 5387 cells, 1000
         vars, 5 batches, 1 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels']
INFO     Successfully registered anndata object containing 5387 cells, 1000
         vars, 5 batches, 1 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc

This is a typical SCVI object, and after training, can be used in any defined way.

For training the query data, we recommend using a weight_decay of 0.0. This ensures the latent representation of the reference cells will remain exactly the same if passing them through this new query model.

[13]:
vae_q.train(n_epochs=200, weight_decay=0.0)
adata_query.obsm["X_scVI"] = vae_q.get_latent_representation()
INFO     Training for 200 epochs
INFO     KL warmup phase exceeds overall training phaseIf your applications rely
         on the posterior quality, consider training for more epochs or reducing
         the kl warmup.
INFO     KL warmup for 400 epochs
Training...:   0%|          | 0/200 [00:00<?, ?it/s]
/usr/local/lib/python3.6/dist-packages/scvi/core/distributions/_negative_binomial.py:434: UserWarning: The value argument must be within the support of the distribution
  UserWarning,
Training...: 100%|██████████| 200/200 [01:04<00:00,  3.10it/s]
INFO     Training is still in warming up phase. If your applications rely on the
         posterior quality, consider training for more epochs or reducing the kl
         warmup.
INFO     Training time:  64 s. / 200 epochs
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
[14]:
sc.pp.neighbors(adata_query, use_rep="X_scVI")
sc.tl.leiden(adata_query)
sc.tl.umap(adata_query)
[15]:
sc.pl.umap(
    adata_query,
    color=["study", "cell_type"],
    frameon=False,
    ncols=1,
)
../../_images/user_guide_notebooks_scarches_scvi_tools_24_0.png

Visualize reference and query

[16]:
adata_full = adata_query.concatenate(adata_ref)

The concatenated object has the latent representations of both reference and query, but we are also able to reobtain these values using the query model.

[17]:
adata_full.obsm["X_scVI"] = vae_q.get_latent_representation(adata_full)
INFO     Input adata not setup with scvi. attempting to transfer anndata setup
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels']
INFO     Successfully registered anndata object containing 15681 cells, 1000
         vars, 5 batches, 1 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc
[18]:
sc.pp.neighbors(adata_full, use_rep="X_scVI")
sc.tl.leiden(adata_full)
sc.tl.umap(adata_full)
[19]:
sc.pl.umap(
    adata_full,
    color=["study", "cell_type"],
    frameon=False,
    ncols=1,
)
... storing 'study' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_30_1.png

Online update of SCANVI

We’ll use the same Pancreas dataset, this time we set it up such that we register that the dataset has labels.

The advantage of SCANVI is that we’ll be able to predict the cell type labels of the query dataset. In the case of SCVI, a separate classifier (e.g., nearest-neighbor, random forest, etc.) would have to be trained on the reference latent space.

[20]:
scvi.data.setup_anndata(adata_ref, batch_key="study", labels_key="cell_type")
INFO     Using batches from adata.obs["study"]
INFO     Using labels from adata.obs["cell_type"]
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 10294 cells, 1000
         vars, 3 batches, 8 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
INFO     Please do not further modify adata until model is trained.

Train reference

[21]:
# unlabeled category does not exist in adata.obs[labels_key]
# so all cells are treated as labeled
vae_ref = scvi.model.SCANVI(
    adata_ref,
    unlabeled_category="Unknown",
    **arches_params,
)
[22]:
vae_ref.train()
INFO     Training Unsupervised Trainer for 400 epochs.
INFO     Training SemiSupervised Trainer for 10 epochs.
INFO     KL warmup for 400 epochs
Training...:   0%|          | 0/400 [00:00<?, ?it/s]
/usr/local/lib/python3.6/dist-packages/scvi/core/distributions/_negative_binomial.py:434: UserWarning: The value argument must be within the support of the distribution
  UserWarning,
Training...: 100%|██████████| 400/400 [04:57<00:00,  1.34it/s]
INFO     Training time:  297 s. / 400 epochs
INFO     KL warmup phase exceeds overall training phaseIf your applications rely
         on the posterior quality, consider training for more epochs or reducing
         the kl warmup.
INFO     KL warmup for 400 epochs
Training...: 100%|██████████| 10/10 [00:23<00:00,  2.36s/it]
INFO     Training is still in warming up phase. If your applications rely on the
         posterior quality, consider training for more epochs or reducing the kl
         warmup.
INFO     Training time:  23 s. / 10 epochs
[23]:
adata_ref.obsm["X_scANVI"] = vae_ref.get_latent_representation()
sc.pp.neighbors(adata_ref, use_rep="X_scANVI")
sc.tl.leiden(adata_ref)
sc.tl.umap(adata_ref)
[24]:
sc.pl.umap(
    adata_ref,
    color=["study", "cell_type"],
    frameon=False,
    ncols=1,
)
../../_images/user_guide_notebooks_scarches_scvi_tools_37_0.png

Online update with query

[25]:
dir_path_scan = "pancreas_model_scanvi/"
vae_ref.save(dir_path_scan, overwrite=True)

SCANVI is a bit trickier. Because the query data has the true cell type labels, we need to do something a bit hacky and tell the model which cells are labeled and which are unlabeled. We will make this cleaner in a future release.

[26]:
vae_q = scvi.model.SCANVI.load_query_data(
    adata_query,
    dir_path_scan,
)
vae_q._unlabeled_indices = np.arange(adata_query.n_obs)
vae_q._labeled_indices = []
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels']
INFO     Successfully registered anndata object containing 5387 cells, 1000
         vars, 5 batches, 8 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc
[27]:
vae_q.train(
    n_epochs_semisupervised=100,
    train_base_model=False,
    semisupervised_trainer_kwargs=dict(weight_decay=0.0),
    frequency=10,
)
INFO     Training Unsupervised Trainer for 400 epochs.
INFO     Training SemiSupervised Trainer for 100 epochs.
/usr/local/lib/python3.6/dist-packages/scvi/core/distributions/_negative_binomial.py:434: UserWarning: The value argument must be within the support of the distribution
  UserWarning,
INFO     KL warmup phase exceeds overall training phaseIf your applications rely
         on the posterior quality, consider training for more epochs or reducing
         the kl warmup.
INFO     KL warmup for 400 epochs
Training...: 100%|██████████| 100/100 [01:20<00:00,  1.24it/s]
INFO     Training is still in warming up phase. If your applications rely on the
         posterior quality, consider training for more epochs or reducing the kl
         warmup.
INFO     Training time:  71 s. / 100 epochs
[28]:
adata_query.obsm["X_scANVI"] = vae_q.get_latent_representation()
adata_query.obs["predictions"] = vae_q.predict()
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
[29]:
df = adata_query.obs.groupby(["cell_type", "predictions"]).size().unstack(fill_value=0)
norm_df = df / df.sum(axis=0)

plt.figure(figsize=(8, 8))
_ = plt.pcolor(norm_df)
_ = plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, rotation=90)
_ = plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xlabel("Predicted")
plt.ylabel("Observed")


[29]:
Text(0, 0.5, 'Observed')
../../_images/user_guide_notebooks_scarches_scvi_tools_44_1.png

Analyze reference and query

[30]:
adata_full = adata_query.concatenate(adata_ref)

This just makes a column in the anndata corresponding to if the data come from the reference or query sets.

[31]:
adata_full.obs.batch.cat.rename_categories(["Query", "Reference"], inplace=True)
[32]:
full_predictions = vae_q.predict(adata_full)
print("Acc: {}".format(np.mean(full_predictions == adata_full.obs.cell_type)))

# adata_full.obs["predictions"] = ["Known"] * adata_ref.n_obs + list(adata_query.obs["predictions"].to_numpy())
adata_full.obs["predictions"] = full_predictions
INFO     Input adata not setup with scvi. attempting to transfer anndata setup
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels']
INFO     Successfully registered anndata object containing 15681 cells, 1000
         vars, 5 batches, 8 labels, and 0 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
WARNING  Make sure the registered X field in anndata contains unnormalized count
         data.
/usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc
Acc: 0.8886550602640138
[33]:
sc.pp.neighbors(adata_full, use_rep="X_scANVI")
sc.tl.leiden(adata_full)
sc.tl.umap(adata_full)
[34]:
sc.pl.umap(
    adata_full,
    color=["study", "cell_type"],
    frameon=False,
    ncols=1,
)
... storing 'study' as categorical
... storing 'predictions' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_51_1.png
[35]:
ax = sc.pl.umap(
    adata_full,
    frameon=False,
    show=False,
)
sc.pl.umap(
    adata_full[:adata_query.n_obs],
    color=["predictions"],
    frameon=False,
    title="Query predictions",
    ax=ax,
    alpha=0.7
)

ax = sc.pl.umap(
    adata_full,
    frameon=False,
    show=False,
)
sc.pl.umap(
    adata_full[:adata_query.n_obs],
    color=["cell_type"],
    frameon=False,
    title="Query observed cell types",
    ax=ax,
    alpha=0.7
)
Trying to set attribute `.uns` of view, copying.
../../_images/user_guide_notebooks_scarches_scvi_tools_52_1.png
../../_images/user_guide_notebooks_scarches_scvi_tools_52_2.png

Online update of TOTALVI

This workflow works very similarly for TOTALVI. Here we demonstrate how to build a CITE-seq reference and use scRNA-seq only data as the query.

Assemble data

For totalVI, we will treat two CITE-seq PBMC datasets from 10X Genomics as the reference. These datasets were already filtered for outliers like doublets, as described in the totalVI manuscript. There are 14 proteins in the reference.

[36]:
adata_ref = scvi.data.pbmcs_10x_cite_seq(run_setup_anndata=False)
INFO     Downloading file at data/pbmc_10k_protein_v3.h5ad
Downloading...: 24938it [00:00, 109069.63it/s]
INFO     Downloading file at data/pbmc_5k_protein_v3.h5ad
Downloading...: 100%|██████████| 18295/18295.0 [00:00<00:00, 93429.07it/s]
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.

In general, there will be some necessary data wrangling. For example, we need to provide totalVI with some protein data – and when it’s all zeros, totalVI identifies that the protein data is missing in this “batch”.

It could have also been the case that only some of the protein data was missing, in which case we would add zeros for each of the missing proteins.

[37]:
adata_query = scvi.data.dataset_10x("pbmc_10k_v3")
adata_query.obs["batch"] = "PBMC 10k (RNA only)"
# put matrix of zeros for protein expression (considered missing)
pro_exp = adata_ref.obsm["protein_expression"]
data = np.zeros((adata_query.n_obs, pro_exp.shape[1]))
adata_query.obsm["protein_expression"] = pd.DataFrame(columns=pro_exp.columns, index=adata_query.obs_names, data = data)
INFO     Downloading file at data/10X/pbmc_10k_v3/filtered_feature_bc_matrix.h5
Downloading...: 37492it [00:02, 14659.13it/s]
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.

We do some light QC filtering on the query dataset (doublets, mitochondrial, etc.)

[38]:
import scrublet as scr
scrub = scr.Scrublet(adata_query.X)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
adata_query = adata_query[~predicted_doublets].copy()

adata_query.var['mt'] = adata_query.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata_query, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata_query = adata_query[adata_query.obs.pct_counts_mt < 15, :].copy()
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.32
Detected doublet rate = 4.9%
Estimated detectable doublet fraction = 56.4%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 8.7%
Elapsed time: 16.1 seconds

Now to concatenate the objects, which intersects the genes properly.

[39]:
adata_full = anndata.concat([adata_ref, adata_query])
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.

And split them back up into reference and query (but now genes are properly aligned between objects).

[40]:
adata_ref = adata_full[np.logical_or(adata_full.obs.batch == "PBMC5k", adata_full.obs.batch == "PBMC10k")].copy()
adata_query = adata_full[adata_full.obs.batch == "PBMC 10k (RNA only)"].copy()

Observation names are not unique. To make them unique, call `.obs_names_make_unique`.

We run gene selection on the reference, because that’s all that will be avaialble to us at first.

[41]:
sc.pp.highly_variable_genes(
    adata_ref,
    n_top_genes=4000,
    flavor="seurat_v3",
    batch_key="batch",
    subset=True,
)
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.

Finally, we use these selected genes for the query dataset as well.

[42]:
adata_query = adata_query[:, adata_ref.var_names].copy()

Train reference

[43]:
scvi.data.setup_anndata(
    adata_ref,
    batch_key="batch",
    protein_expression_obsm_key="protein_expression"
)
INFO     Using batches from adata.obs["batch"]
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Using protein expression from adata.obsm['protein_expression']
INFO     Using protein names from columns of adata.obsm['protein_expression']
INFO     Successfully registered anndata object containing 10849 cells, 4000
         vars, 2 batches, 1 labels, and 14 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
INFO     Please do not further modify adata until model is trained.
[44]:
arches_params = dict(
    use_layer_norm="both",
    use_batch_norm="none",
)
vae_ref = scvi.model.TOTALVI(
    adata_ref,
    use_cuda=True,
    **arches_params
)
[45]:
vae_ref.train()
INFO     Training for 400 epochs.
INFO     KL warmup for 8136.75 iterations
Training...:  76%|███████▌  | 303/400 [04:33<01:28,  1.10it/s]INFO     Reducing LR on epoch 303.
Training...:  88%|████████▊ | 354/400 [05:19<00:41,  1.10it/s]INFO     Reducing LR on epoch 354.
Training...: 100%|██████████| 400/400 [06:00<00:00,  1.11it/s]
INFO     Training time:  340 s. / 400 epochs
[46]:
adata_ref.obsm["X_totalVI"] = vae_ref.get_latent_representation()
sc.pp.neighbors(adata_ref, use_rep="X_totalVI")
sc.tl.umap(adata_ref, min_dist=0.4)
[47]:
sc.pl.umap(
    adata_ref,
    color=["batch"],
    frameon=False,
    ncols=1,
    title="Reference"
)
... storing 'batch' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_75_1.png
[48]:
dir_path = "saved_model/"
vae_ref.save(dir_path, overwrite=True)

Online update with query

[49]:
vae_q = scvi.model.TOTALVI.load_query_data(
    adata_query,
    dir_path,
    freeze_expression=True
)
INFO     .obs[_scvi_labels] not found in target, assuming every cell is same
         category
INFO     Found batches with missing protein expression
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels', 'protein_expression']
INFO     Successfully registered anndata object containing 10135 cells, 4000
         vars, 3 batches, 1 labels, and 14 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
[50]:
vae_q.train(200, weight_decay=0.0)
INFO     Training for 200 epochs.
INFO     KL warmup for 7601.25 iterations
Training...: 100%|██████████| 200/200 [03:40<00:00,  1.10s/it]
INFO     Training is still in warming up phase. If your applications rely on the
         posterior quality, consider training for more epochs or reducing the kl
         warmup.
INFO     Training time:  211 s. / 200 epochs
[51]:
adata_query.obsm["X_totalVI"] = vae_q.get_latent_representation()
sc.pp.neighbors(adata_query, use_rep="X_totalVI")
sc.tl.umap(adata_query, min_dist=0.4)

Impute protein data for query and visualize

Now that we have updated with the query, we can impute the proteins that were observed in the reference, using the transform_batch parameter.

[52]:
_, imputed_proteins = vae_q.get_normalized_expression(
    adata_query,
    n_samples=25,
    return_mean=True,
    transform_batch=["PBMC10k", "PBMC5k"],
)

Very quickly we can identify the major expected subpopulations of B cells, CD4 T cells, CD8 T cells, monocytes, etc.

[53]:
adata_query.obs = pd.concat([adata_query.obs, imputed_proteins], axis=1)

sc.pl.umap(
    adata_query,
    color=imputed_proteins.columns,
    frameon=False,
    ncols=3,
)
... storing 'batch' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_85_1.png

Visualize reference and query

[54]:
adata_full_new = adata_query.concatenate(adata_ref, batch_key="none")
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
[55]:
adata_full_new.obsm["X_totalVI"] = vae_q.get_latent_representation(adata_full_new)
sc.pp.neighbors(adata_full_new, use_rep="X_totalVI")
sc.tl.umap(adata_full_new, min_dist=0.3)
INFO     Input adata not setup with scvi. attempting to transfer anndata setup
INFO     Found batches with missing protein expression
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var',
         'labels', 'protein_expression']
INFO     Successfully registered anndata object containing 20984 cells, 4000
         vars, 3 batches, 1 labels, and 14 proteins. Also registered 0 extra
         categorical covariates and 0 extra continuous covariates.
[56]:
_, imputed_proteins_all = vae_q.get_normalized_expression(
    adata_full_new,
    n_samples=25,
    return_mean=True,
    transform_batch=["PBMC10k", "PBMC5k"],
)

for i, p in enumerate(imputed_proteins_all.columns):
    adata_full_new.obs[p] = imputed_proteins_all[p].to_numpy().copy()
[57]:
perm_inds = np.random.permutation(np.arange(adata_full_new.n_obs))
sc.pl.umap(
    adata_full_new[perm_inds],
    color=["batch"],
    frameon=False,
    ncols=1,
    title="Reference and query"
)
/usr/local/lib/python3.6/dist-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Initializing view as actual.
  "Initializing view as actual.", ImplicitModificationWarning
Trying to set attribute `.obs` of view, copying.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
... storing 'batch' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_90_1.png
[58]:
ax = sc.pl.umap(
    adata_full_new,
    color="batch",
    groups=["PBMC 10k (RNA only)"],
    frameon=False,
    ncols=1,
    title="Reference and query",
    alpha=0.4
)
... storing 'batch' as categorical
../../_images/user_guide_notebooks_scarches_scvi_tools_91_1.png
[59]:
sc.pl.umap(
    adata_full_new,
    color=imputed_proteins_all.columns,
    frameon=False,
    ncols=3,
    vmax="p99"
)
../../_images/user_guide_notebooks_scarches_scvi_tools_92_0.png