Interoperability with R and Seurat

In this tutorial, we go over how to use basic scvi-tools functionality in R. However, for more involved analyses, we suggest using scvi-tools from Python. Checkout the Scanpy_in_R tutorial for instructions on converting Seurat objects to anndata.

This tutorial requires Reticulate. Please check out our installation guide for instructions on installing Reticulate and scvi-tools.

Loading and processing data with Seurat

We follow the basic Seurat tutorial for loading data and selecting highly variable genes.

Note: scvi-tools requires raw gene expression

[1]:
# We will work within the Seurat framework
library(Seurat)
library(SeuratData)
data("pbmc3k")
pbmc <- pbmc3k

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top2000 <- head(VariableFeatures(pbmc), 2000)
pbmc <- pbmc[top2000]


print(pbmc) # Seurat object
An object of class Seurat
2000 features across 2638 samples within 1 assay
Active assay: RNA (2000 features, 2000 variable features)

Converting Seurat object to AnnData

scvi-tools relies on the AnnData object. Here we show how to convert our Seurat object to anndata for scvi-tools.

[2]:
library(reticulate)

sc <- import('scanpy', convert = FALSE)
scvi <- import('scvi', convert = FALSE)
scvi$settings$progress_bar_style = 'tqdm'
[3]:
adata <- sc$AnnData(
  X   = t(as.matrix(GetAssayData(pbmc,slot='counts'))), #scVI requires raw counts
  obs = pbmc[[]],
  var = GetAssay(pbmc)[[]]
)
print(adata) # Note generally in Python, dataset conventions are obs x var
AnnData object with n_obs × n_vars = 2638 × 2000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'

Setup our AnnData for training

Reticulate allows us to call Python code from R, giving the ability to use all of scvi-tools in R. We encourage you to checkout their documentation and specifically the section on type conversions in order to pass arguments to Python functions.

In this section, we show how to setup the AnnData for scvi-tools, create the model, train the model, and get the latent representation. For a more in depth description of setting up the data, you can checkout our introductory tutorial as well as our data loading tutorial.

[4]:
# run seteup_anndata
scvi$data$setup_anndata(adata)

# create the model
model = scvi$model$SCVI(adata, use_cuda = TRUE)

# train the model
model$train()

# to specify the number of epochs when training:
# model$train(n_epochs = as.integer(400))

None
None

Getting the latent represenation and visualization

Here we get the latent representation of the model and save it back in our Seurat object. Then we run UMAP and visualize.

[5]:
# get the latent represenation
latent = model$get_latent_representation()

# put it back in our original Seurat object
latent <- as.matrix(latent)
rownames(latent) = colnames(pbmc)
pbmc[['scvi']] <- CreateDimReducObject(embeddings = latent, key = "scvi_", assay = DefaultAssay(pbmc))

Warning message:
“No columnames present in cell embeddings, setting to 'scvi_1:10'”
[6]:
# Find clusters, then run UMAP, and visualize
pbmc <- FindNeighbors(pbmc, dims = 1:10, reduction = 'scvi')
pbmc <- FindClusters(pbmc, resolution =1)

pbmc <- RunUMAP(pbmc, dims = 1:10, reduction = 'scvi', n.components = 2)

DimPlot(pbmc, reduction = "umap", pt.size = 3)

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2638
Number of edges: 93916

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.7378
Number of communities: 9
Elapsed time: 0 seconds
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”
19:20:54 UMAP embedding parameters a = 0.9922 b = 1.112

19:20:54 Read 2638 rows and found 10 numeric columns

19:20:54 Using Annoy for neighbor search, n_neighbors = 30

19:20:54 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*

|

19:20:55 Writing NN index file to temp file /tmp/RtmpojZDS5/file336a4f56d149

19:20:55 Searching Annoy index using 1 thread, search_k = 3000

19:20:56 Annoy recall = 100%

19:20:57 Commencing smooth kNN distance calibration using 1 thread

19:20:57 Initializing from normalized Laplacian + noise

19:20:57 Commencing optimization for 500 epochs, with 95720 positive edges

19:21:03 Optimization finished

../../_images/user_guide_notebooks_scvi_in_R_10_3.png

Finding differentially expressed genes with scVI latent space

[7]:
# we need pandas in order to put the seurat clusters into our original anndata
pd <- import('pandas', convert = FALSE)

# workaround for adding seurat_clusters to our anndata
new_obs <- c(adata$obs, pd$DataFrame(pbmc[['seurat_clusters']]))
new_obs <- pd$concat(new_obs, axis =1)

adata$obs <- new_obs

Using our trained SCVI model, we call the differential_expression() method We pass seurat_clusters to the groupby argument and compare between cluster 1 and cluster 2.

The output of DE is a DataFrame with the bayes factors. Bayes factors > 3 have high probability of being differentially expressed. You can also set fdr_target, which will return the differentially expressed genes based on the posteior expected FDR.

[8]:
adata
AnnData object with n_obs × n_vars = 2638 × 2000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'seurat_annotations', 'percent.mt', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var', 'seurat_clusters'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    uns: '_scvi'
[9]:
DE <- model$differential_expression(adata, groupby='seurat_clusters', group1 = '1', group2 = '2')
[10]:
print(DE$head())
        proba_de  proba_not_de  bayes_factor  ...  raw_normalized_mean1  raw_normalized_mean2  comparison
S100A8       1.0           0.0     18.420681  ...              2.056203            238.700439      1 vs 2
TYROBP       1.0           0.0     18.420681  ...              3.094438            146.082336      1 vs 2
S100A9       1.0           0.0     18.420681  ...              4.327612            417.595062      1 vs 2
TYMP         1.0           0.0     18.420681  ...              4.309563             44.836384      1 vs 2
LST1         1.0           0.0     18.420681  ...              4.800844             61.356239      1 vs 2

[5 rows x 17 columns]

Integrating datasets with scVI

Here we integrate two datasets from Seurat’s Immune Alignment Vignette.

[11]:
data("ifnb")

# use seurat for variable gene selection
ifnb <- NormalizeData(ifnb, normalization.method = "LogNormalize", scale.factor = 10000)
ifnb[["percent.mt"]] <- PercentageFeatureSet(ifnb, pattern = "^MT-")
ifnb <- subset(ifnb, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
ifnb <- FindVariableFeatures(ifnb, selection.method = "vst", nfeatures = 2000)
top2000 <- head(VariableFeatures(ifnb), 2000)
ifnb <- ifnb[top2000]
[12]:
adata <- sc$AnnData(
  X   = t(as.matrix(GetAssayData(ifnb, slot='counts'))),
  obs = ifnb[[]]
)

print(adata)
AnnData object with n_obs × n_vars = 13997 × 2000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'stim', 'seurat_annotations', 'percent.mt'
[13]:
# run seteup_anndata, use column stim for batch
scvi$data$setup_anndata(adata, batch_key = 'stim')

# create the model
model = scvi$model$SCVI(adata)

# train the model
model$train()

# to specify the number of epochs when training:
# model$train(n_epochs = as.integer(400))

None
None
[14]:
# get the latent represenation
latent = model$get_latent_representation()

# put it back in our original Seurat object
latent <- as.matrix(latent)
rownames(latent) = colnames(ifnb)
ifnb[['scvi']] <- CreateDimReducObject(embeddings = latent, key = "scvi_", assay = DefaultAssay(ifnb))
Warning message:
“No columnames present in cell embeddings, setting to 'scvi_1:10'”
[15]:
library(cowplot)
# for jupyter notebook
options(repr.plot.width=10, repr.plot.height=8)

ifnb <- RunUMAP(ifnb, dims = 1:10, reduction = 'scvi', n.components = 2)
p1 <- DimPlot(ifnb, reduction = "umap", group.by = "stim", pt.size=2)
plot_grid(p1)
10:10:30 UMAP embedding parameters a = 0.9922 b = 1.112

10:10:30 Read 13997 rows and found 10 numeric columns

10:10:30 Using Annoy for neighbor search, n_neighbors = 30

10:10:30 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*

|

10:10:34 Writing NN index file to temp file /tmp/RtmpIX2mNg/file4cf831829dad

10:10:34 Searching Annoy index using 1 thread, search_k = 3000

10:10:43 Annoy recall = 100%

10:10:44 Commencing smooth kNN distance calibration using 1 thread

10:10:45 Initializing from normalized Laplacian + noise

10:10:46 Commencing optimization for 200 epochs, with 517208 positive edges

10:10:57 Optimization finished

../../_images/user_guide_notebooks_scvi_in_R_22_1.png
[16]:
options(repr.plot.width=12, repr.plot.height=10)

FeaturePlot(ifnb, features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A",
    "CCL2", "PPBP"), min.cutoff = "q9")
Warning message in FetchData(object = object, vars = c(dims, "ident", features), :
“The following requested variables were not found: CD3D”
../../_images/user_guide_notebooks_scvi_in_R_23_1.png
[17]:
FeaturePlot(ifnb, features = c("GNLY", "IFI6"), split.by = "stim", max.cutoff = 3,
    cols = c("grey", "red"))
../../_images/user_guide_notebooks_scvi_in_R_24_0.png