Measurements of expression in single-cell RNA-Sequencing often suffer from a large systematic bias. A major source of this bias is the cell-cycle which introduces within-cell-type heterogeneity that can obscure the differences in expression between cell-types. The package ccRemover provides a method for identifying and removing the effects of the cell-cycle from single-cell RNA-Seq data. This vignette explains the use of the ccRemover r-package which implements the ccRemover algorithm Barron and Li (2016).

1 Preliminaries

1.1 Normalized Data Matrix

As input the ccRemover package expects a normalized, log-transformed and centered matrix of gene expression measurements for single cells obtained from single-cell RNA-Seq (scRNA-Seq) or another high-throughput sequencing experiment. The genes should correspond to the rows and the cells to the columns, that is the measurement in the \(i\)-th row and the \(j\)-th column of the matrix contains the expression of gene \(i\) for cell \(j\).

It is important that the data is normalized for sequencing depth prior to analysis or this may interfere with the identification of the cell-cycle components. From experience we have found that log-transforming the data prior to analysis leads to better results and more accurate cell-cycle removal. The data should be centered on a gene-by-gene basis, this prevents any small set of genes from dominating the principal components used to capture the cell-cycle effect.

To demonstrate the preprocessing steps and the application of ccRemover we will use the differentiating T-helper cell data set that is presented as the first example in the original paper and was retrieved from the supplementary material of Buettner et al. (2015). It was originally generated by Mahata B. et al. (2014) This data set contains normalized and log transformed gene expression measurements for 81 cells and 14,147 genes. First we must load the data into R.

data(t.cell_data)

To check the data is in the appropriate format we can view the file with a text editor or use r.

head(t.cell_data[,1:5])

##        Cell 1  Cell 2  Cell 3  Cell 4 Cell 5
## Gnai3 3.12560 0.86096 2.62610 3.25490 3.7152
## Cdc45 3.09290 0.11331 3.51750 0.47994 2.5712
## Narf  0.25414 0.00000 1.79130 0.00000 1.4729
## Klf6  1.66590 3.00130 1.92170 1.93360 4.1109
## Scmh1 0.25414 0.11331 0.00000 0.00000 1.7960
## Wnt3  0.52967 0.27746 0.61211 0.60523 3.5357

Here the data has genes as rows and cells as columns which is as required for ccRemover. If you have data which is in the opposite format then it must be transposed prior to applying ccRemover. This data is also already normalized and log-transformed. Next we must check that the data has been centered on a gene-by-gene basis. This is easy to see from a summary of the row means.

summary(apply(t.cell_data,1, mean))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0392  0.8990  1.5738  1.6202  2.2817  5.1372

This data is not centered as can be seen from the non-zero row means above. We will center this data on a gene-by-gene basis and then recheck the row means to ensure the data has been centered correctly. The mean values will also be stored so that they can be added back into the cleaned data matrix

mean_gene_exp <- rowMeans(t.cell_data)
t_cell_data_cen <- t.cell_data - mean_gene_exp
summary(apply(t_cell_data_cen,1,mean))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -4.441e-16 -5.211e-17 -6.924e-19 -1.453e-18  4.661e-17  3.838e-16

From the summary of the row means we can see that the centering has been effective and the genes now have mean 0 (or very close to it).

1.2 The cell-cycle genes

In order to remove the cell-cycle effect from the data ccRemover uses sets of genes annotated to the cell-cycle and not annotated to the cell-cycle. ccRemover uses the set of genes which are not annotated to the cell-cycle to identify the main effects that are present in the data. It then uses the cell-cycle annotated genes to identify which of the effects are related to the cell-cycle before removing them.

These genes can be identified using your own methods or using ccRemover’s built-in cell-cycle gene identifier. This information is entered into ccRemover as a vector of length \(n\), ‘if_cc’, where \(n\) is the number of genes in the data matrix and ‘if_cc[i] == TRUE’ if gene \(i\) is a gene annotated to the cell-cycle or ‘if_cc[i] == FALSE’ if gene \(i\) is not annotated to the cell-cycle. In other words ‘if_cc’ is an indicator vector indicating whether a gene is a cell-cycle gene or not.

If you are using ccRemover to remove another source of unwanted variation in the data that is not the cell-cycle then you will need to determine yourself which genes are effected by the feature which you are interested in removing. Thus the vector will have entries ‘if_cc[i] == TRUE’ if gene \(i\) is affected by the feature of interest and ‘if_cc[i] == FALSE’ if gene \(i\) is not affected by the feature of interest.

To identify the cell-cycle genes in your own data set using ccRemovers built in function we recommend following the procedure below. First extract the gene names from the data matrix:

gene_names <- rownames(t_cell_data_cen)

Next we use ccRemovers built-in function to identify the indices of the cell-cycle genes from the data. This function requires the additional arguments ‘species’ and ‘name_type’, these arguments correspond to the species of the samples and the format of the gene ID’s respectively. Currently ccRemover is able to identify genes from Homo Sapiens (‘“human”’) and Mus Musculus (‘“mouse”’). In addition it is able to function using Ensembl Gene IDs (‘“ensembl”’), HGNC/MGI symbols (‘“symbol”’), Entrez Gene IDs (‘“entrez”’) and Unigene IDs (‘“unigene”’). If your gene ID’s are not in one of these formats then they must be converted prior to using ccRemover’s built-in gene indexer, Biomart provides a great data repository and user interface from which it is possible to download the data necessary for Gene ID conversion. If your names are of a different type but you have identified your own set of cell-cycle related genes and just need to use the main ccRemover function then ccRemover only needs to be provided with the vector described above and no ID conversion is necessary.

This function identifies which of the gene names are annotated to the cell-cycle and returns a list of the indices of the cell-cycle genes. As the samples in this data set are from a mouse and the gene names are MGI symbols we select ‘species = “mouse”’ and ‘name_type = “symbols”’.

cell_cycle_gene_indices <- gene_indexer(gene_names, species = "mouse", 
                                        name_type = "symbols" )

## Invalid name type input. Switching to NULLNo name format input.
## Checking to see if match can be found:
## Best guess is  symbol  IDs
## 751  matches out of a possible  7073

length(cell_cycle_gene_indices)

## [1] 751

Here we have identified 751 of the 7,073 genes as cell-cycle related genes.

Finally we must create the vector which will be used in the main ccRemover procedure. This step should be followed if you have calculated your own gene indices as well. Just sub in your own set of indices in place of ‘cell_cycle_gene_indices’ in the code below. We first create a vector of ‘FALSE’ values of length ‘n’ to store the values and then change the entries corresponding to the cell-cycle genes to ‘TRUE’. Finally we summarize the vector to ensure the correct number of cell-cycle genes has been assigned.

if_cc <- rep(FALSE,nrow(t_cell_data_cen)) 
if_cc[cell_cycle_gene_indices] <- TRUE
summary(if_cc)

##    Mode   FALSE    TRUE 
## logical    6322     751

Here we see that there are 751 values of ‘TRUE’ in the ‘if_cc’ vector and 6,322 values of ‘FALSE’ as expected.

1.3 Putting it Together

Now that we have our normalized, centered and log-transformed data matrix along with our true/false vector indicating which genes are cell-cycle genes we must put them into a list prior to applying the main ccRemover function. The data matrix in the format described above should be saved as ‘x’ and the ‘if_cc’ vector should be saved as ‘if_cc’

dat <- list(x=t_cell_data_cen, if_cc=if_cc)

Now we are ready to remove the cell-cycle effects from the data set.

2 ccRemover

2.1 Applying ccRemover

Using the data list which has been created in the previous step we can now run ccRemover. Using the ‘if_cc’ vector ccRemover splits the data into two sets, the cell-cycle annotated genes and the genes not annotated to the cell-cycle, which we call the control genes. ccRemover identifies the main components of variation in the control genes using a simple principal components analysis. These are then compared with the cell-cycle genes to identify which of them are cell-cycle related. Finally the effect of the components identified as cell-cycle related is removed from the data. Please refer to the original manuscript for a detailed explanation of the method and examples of its application.

As ccRemover runs it will print a table displaying the loadings of each component on the control genes (xn.load), the cell-cycle genes (xy.load), the difference between the two (diff.load) and the t-score for the difference in loading (t.load.boot) for each iteration. Those components which have a t-score greater than the cutoff (default value = 3) will be classified as cell-cycle effects and removed from the data.

For this vigenette we turn off the progress bar, tracking the bootstrap repitions, as it does not work well with R-markdown.

xhat <- ccRemover(dat, bar=FALSE)

## 0.1061784  of genes are cell-cycle genes
## Iteration  1 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  4.764066 5.2458626  0.48179702   7.1284190
## PC2  1.899494 1.9400189  0.04052475   0.6857840
## PC3  1.307673 1.4149245  0.10725191   1.6297520
## PC4  1.134286 1.3889111  0.25462511   1.9128193
## PC5  1.108502 1.1296735  0.02117137   0.1604851
## PC6  1.198600 1.1571330 -0.04146651  -0.4073728
## PC7  1.076888 0.9899750 -0.08691289  -1.1274758
## PC8  1.045047 0.9508303 -0.09421625  -1.2408157
## PC9  1.209063 1.2256193  0.01655664   0.2293425
## PC10 1.167671 1.2846210  0.11695048   1.4449840
## The follow components are removed: 1
## 
## Iteration  2 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  1.899494 1.9400189  0.04052475   0.8398206
## PC2  1.307673 1.4149245  0.10725191   1.4967569
## PC3  1.134286 1.3889111  0.25462511   1.9382202
## PC4  1.108502 1.1296735  0.02117137   0.1723988
## PC5  1.198600 1.1571330 -0.04146651  -0.4817683
## PC6  1.076888 0.9899750 -0.08691289  -1.2695125
## PC7  1.045047 0.9508303 -0.09421625  -1.2992680
## PC8  1.209063 1.2256193  0.01655664   0.2189583
## PC9  1.167671 1.2846210  0.11695048   1.4874693
## PC10 1.009986 0.9815659 -0.02841966  -0.4321327
## No more cell-cycle effect is detected.

For this data set we see that ccRemover identifies the first principal component as a cell-cycle effect on its first iteration (t.load.boot = 7.13). Once this component has been removed ccRemover does not identify any other cell-cycle effects present in the data on the second iteration. The ccRemover function outputs a transformed data matrix, ‘xhat’ which is free of cell-cycle effects. Further analysis can then be carried out on this data set with the confounding effects of the cell-cycle present. This can greatly improve the performance of clustering algorithms on the data.

The final step here is to add the mean values back to the cleaned data matrix:

xhat <- xhat + mean_gene_exp

2.2 Settings

If you choose to run ccRemover not using the default settings the following options are available to you:

xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar=FALSE)

## 0.1061784  of genes are cell-cycle genes
## Iteration  1 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  4.764066 5.2458626  0.48179702   7.2438365
## PC2  1.899494 1.9400189  0.04052475   0.7154375
## PC3  1.307673 1.4149245  0.10725191   1.5429677
## PC4  1.134286 1.3889111  0.25462511   1.8296309
## PC5  1.108502 1.1296735  0.02117137   0.1681040
## PC6  1.198600 1.1571330 -0.04146651  -0.4494239
## PC7  1.076888 0.9899750 -0.08691289  -1.3333344
## PC8  1.045047 0.9508303 -0.09421625  -1.3052767
## PC9  1.209063 1.2256193  0.01655664   0.2126655
## PC10 1.167671 1.2846210  0.11695048   1.4831892
## The follow components are removed: 1
## 
## Iteration  2 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  1.899494 1.9400189  0.04052475   0.8044851
## PC2  1.307673 1.4149245  0.10725191   1.6110481
## PC3  1.134286 1.3889111  0.25462511   1.9053519
## PC4  1.108502 1.1296735  0.02117137   0.1646352
## PC5  1.198600 1.1571330 -0.04146651  -0.4511874
## PC6  1.076888 0.9899750 -0.08691289  -1.2669308
## PC7  1.045047 0.9508303 -0.09421625  -1.2977739
## PC8  1.209063 1.2256193  0.01655664   0.2083617
## PC9  1.167671 1.2846210  0.11695048   1.4355563
## PC10 1.009986 0.9815659 -0.02841966  -0.4007874
## No more cell-cycle effect is detected.

The ‘cutoff’ is used to determine which of the effects are cell-cycle effects. The default and recommended value is 3, which roughly corresponds to a p-value of 0.01. For data sets which have very low levels of cell-cycle activity this value can be lowered to increase the detection of cell-cycle effects. Example 3 in the original manuscript was a case where a lower value of the cutoff was necessary.
The ‘max.it’ value is the maximum number of iterations of the method. ccRemover will stop whenever it detects no more significant effects present in the data or it reaches its maximum number of iterations. The default value is 4 but we have found that for many data sets the cell-cycle effect will be effectively removed after 1 or 2 iterations.
The ‘nboot’ value corresponds to the number of bootstrap repetitions carried out to test the significance of the components. Please refer to the methods section original manuscript for a detailed description of the estimation process. The default value is 200 and we have found this work effectively for most data sets.
The ‘ntop’ parameter determines the number of principal components which are to be tested as cell-cycle effects upon each iteration. We have found that the default value of 10 works effectively for most data sets. However, for extremely diverse data sets within which there are expected to be many sources of variation this value could be raised so that all elements of the cell-cycle effect can be identified and removed.

Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data - the ccRemover package

Martin Barron, Jun Li

2017-08-17