Input Format

This vignette provides a basic example of using Bisque to decompose bulk expression. Bisque offers two modes of operation: Reference-based and Marker-based decomposition. We will provide brief examples of both.

library(Biobase)
library(BisqueRNA)

Input Format

Bisque requires expression data in the ExpressionSet format from the Biobase package.

Bulk RNA-seq data can be converted from a matrix (columns are samples, rows are genes) to an ExpressionSet as follows:

bulk.eset <- Biobase::ExpressionSet(assayData = bulk.matrix)

Single-cell data requires additional information in the ExpressionSet, specificially cell type labels and individual labels. Individual labels indicate which individual each cell originated from. To add this information, Biobase requires it to be stored in a data frame format. Assuming we have character vectors of cell type labels (cell.type.labels) and individual labels (individual.labels), we can convert scRNA-seq data (with counts also in matrix format) as follows:

sample.ids <- colnames(sc.counts.matrix)
# individual.ids and cell.types should be in the same order as in sample.ids
sc.pheno <- data.frame(check.names=F, check.rows=F,
                       stringsAsFactors=F,
                       row.names=sample.ids,
                       SubjectName=individual.labels,
                       cellType=cell.type.labels)
sc.meta <- data.frame(labelDescription=c("SubjectName",
                                         "cellType"),
                      row.names=c("SubjectName",
                                  "cellType"))
sc.pdata <- new("AnnotatedDataFrame",
                data=sc.pheno,
                varMetadata=sc.meta)
sc.eset <- Biobase::ExpressionSet(assayData=sc.counts.matrix,
                                  phenoData=sc.pdata)

If your single-cell data (from 10x platform) is in a Seurat object with cell type assignments, Bisque includes a function that will automatically convert this object to an ExpressionSet:

sc.eset <- BisqueRNA::SeuratToExpressionSet(seurat.obj, delimiter="-", position=2, version="v3")

The delimiter and position arguments describe the barcode format of 10x single-cell data. For example, barcodes of “ATCGATCG-1” and “ATGCAAGT-2” have the individual ID in position 2 after splitting by the delimiter '-'.

Here is an example of input single-cell and bulk data for 2 individuals with 10 cells sequenced each:

sampleNames(sc.eset)

##  [1] "AAAA-1" "TAAA-1" "CAAA-1" "GAAA-1" "ATAA-1" "TTAA-1" "CTAA-1" "GTAA-1"
##  [9] "ACAA-1" "TCAA-1" "AAAA-2" "TAAA-2" "CAAA-2" "GAAA-2" "ATAA-2" "TTAA-2"
## [17] "CTAA-2" "GTAA-2" "ACAA-2" "TCAA-2"

sc.eset$SubjectName

##  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "2" "2" "2" "2" "2" "2" "2" "2" "2"
## [20] "2"

sc.eset$cellType

##  [1] "Neurons"          "Neurons"          "Neurons"          "Neurons"         
##  [5] "Neurons"          "Neurons"          "Neurons"          "Astrocytes"      
##  [9] "Astrocytes"       "Astrocytes"       "Neurons"          "Neurons"         
## [13] "Neurons"          "Neurons"          "Astrocytes"       "Astrocytes"      
## [17] "Astrocytes"       "Astrocytes"       "Oligodendrocytes" "Microglia"

sampleNames(bulk.eset)

## [1] "1" "2"

Note that if you have samples with both single-cell and bulk RNA-seq data, their IDs should be found in both sc.eset$SubjectName and sampleNames(bulk.eset) .

Reference-based decomposition

We will use data simulated under a simple model (code for SimulateData() can be found in R/simulation.R). We simulate single-cell and bulk RNA-seq counts for 10 individuals. We remove 5 individuals from the single-cell data. We will estimate the cell composition for these 5 individuals.

cell.types <- c("Neurons", "Astrocytes", "Oligodendrocytes", "Microglia", "Endothelial Cells")
avg.props <- c(.5, .2, .2, .07, .03)
sim.data <- SimulateData(n.ind=10, n.genes=100, n.cells=500, cell.types=cell.types, avg.props=avg.props)
sc.eset <- sim.data$sc.eset[,sim.data$sc.eset$SubjectName %in% as.character(6:10)]
bulk.eset <- sim.data$bulk.eset
true.props <- sim.data$props
markers <- sim.data$markers

By default, Bisque uses all genes for decomposition. However, you may supply a list of genes (such as marker genes) to be used with the markers parameter. Also, since we have samples with both bulk and single-cell RNA-seq data, we set the use.overlap parameter to TRUE. If there are no overlapping samples, you can set this parameter to FALSE (we expect performance to be better if overlapping samples are available).

Here's how to call the reference-based decomposition method:

res <- BisqueRNA::ReferenceBasedDecomposition(bulk.eset, sc.eset, markers=NULL, use.overlap=TRUE)

## Decomposing into 5 cell types.

## Found 5 samples with bulk and single-cell expression.

## Remaining 5 bulk samples will be decomposed.

## Using 100 genes in both bulk and single-cell expression.

## Converting single-cell counts to CPM and filtering zero variance genes.

## Filtered 6 zero variance genes.

## Converting bulk counts to CPM and filtering unexpressed genes.

## Filtered 6 unexpressed genes.

## Generating single-cell based reference from 2500 cells.

## Learning bulk transformation from overlapping samples.

## Applying transformation to bulk samples and decomposing.

A list is returned with decomposition estimates in slot bulk.props.

ref.based.estimates <- res$bulk.props
knitr::kable(ref.based.estimates, digits=2)

	1	2	3	4	5
Astrocytes	0.19	0.18	0.19	0.20	0.18
Endothelial Cells	0.03	0.04	0.03	0.03	0.04
Microglia	0.09	0.07	0.06	0.08	0.07
Neurons	0.50	0.54	0.52	0.50	0.51
Oligodendrocytes	0.20	0.17	0.20	0.20	0.20

Just to make sure this worked, we can correlate all the estimates with the true proportions.

r <- cor(as.vector(ref.based.estimates), 
         as.vector(true.props[row.names(ref.based.estimates),colnames(ref.based.estimates)]))
knitr::knit_print(sprintf("R: %f", r))

## [1] "R: 0.998478"

Marker-based decomposition

BisqueMarker can provide estimates of relative cell type abundances using only known marker genes when a reference profile is not available. Marker genes are stored in a data frame with columns that specify gene, cluster that the gene is a marker for, and an optional column for weights (typically fold-change). Here's what this data frame might look like:

gene	cluster	avg_logFC
Gene 1	Neurons	0.82
Gene 2	Neurons	0.59
Gene 3	Astrocytes	0.68
Gene 4	Oligodendrocytes	0.66
Gene 5	Microglia	0.71
Gene 6	Endothelial Cells	0.62

Here's how to call the marker-based decomposition method:

res <- BisqueRNA::MarkerBasedDecomposition(bulk.eset, markers, weighted=F)

## Getting unique markers

## Estimating proportions for 5 cell types in 10 samples

## Filtered 6 zero variance genes.

## Using 6 genes for cell type Astrocytes;

## 100% of 6 marker genes correlate positively with PC1 for cell type Astrocytes

## Using 6 genes for cell type Endothelial Cells;

## 100% of 6 marker genes correlate positively with PC1 for cell type Endothelial Cells

## Using 5 genes for cell type Microglia;

## 100% of 5 marker genes correlate positively with PC1 for cell type Microglia

## Using 15 genes for cell type Neurons;

## 93% of 15 marker genes correlate positively with PC1 for cell type Neurons

## Using 10 genes for cell type Oligodendrocytes;

## 100% of 10 marker genes correlate positively with PC1 for cell type Oligodendrocytes

## Finished estimating cell type proportions using PCA

A list is returned with decomposition estimates in slot bulk.props.

marker.based.estimates <- res$bulk.props
knitr::kable(marker.based.estimates, digits = 2)

	1	2	3	4	5	6	7	8	9	10
Astrocytes	0.35	-0.93	0.84	3.46	-2.17	-0.62	-0.13	-0.93	2.93	-2.79
Endothelial Cells	1.78	-0.46	1.52	-0.29	2.55	-0.97	-2.57	0.05	-1.71	0.10
Microglia	1.73	-0.18	-2.01	0.70	0.07	-1.65	2.91	-0.65	1.98	-2.90
Neurons	-2.99	4.48	1.00	-2.56	-1.55	1.32	1.43	-2.68	0.35	1.20
Oligodendrocytes	1.71	-3.74	-0.18	-0.53	1.40	0.98	-2.64	3.21	-2.67	2.46

Note that these estimates are relative within each cell type, so you cannot immediately compare abundance estimates between cell types.

Just to make sure this worked, we can correlate these estimates with the scaled true proportions.

scaled.true.props <- t(scale(t(true.props)))[rownames(marker.based.estimates),]
r <- cor(as.vector(marker.based.estimates),
         as.vector(scaled.true.props))
knitr::knit_print(sprintf("R: %f", r))

## [1] "R: 0.924982"