This manual explains how to run the FateID algorithm on sample data. FateID is a method for the quantification of cell fate bias in single cell transcriptome datasets comprising different cell types that emerge from a common progenitor (Herman, Sagar, and Grün 2018). The progenitor populations are expected to be part of the dataset and the FateID algorithm was designed to learn a pre-existing bias of each progenitor cell to one or multiple alternative terminal fates. The strategy of the algorithm is to apply an iterative random forest classification (Breiman 2001) in order to quantify fate bias in increasingly naive progenitors using cells that have been classified in previous iterations as training set.
FateID can be directly installed from CRAN:
After installation the FateID package has to be loaded:
In the following sections we describe the application of FateID on sample data. We do not explain all input and return arguments of the FateID functions in detail, since extensive description is available in the man pages for all of the functions.
The algorithm requires an expression data frame as input with cells as columns and genes as features. Column names are expected to correspond to cell IDs and row names are expected to correspond to gene IDs.
Example data are available from the FateID package. The dataset has to be imported in order to reproduce the analysis in this reference manual:
This dataset contains transcript counts of mouse intestinal epithelial cells positive for an Lgr5-lineage reporter after 5 days of lineage tracing (Grun et al. 2016), i. e. these cells are 5 days old progeny of Lgr5-positive intestinal stem cells:
x <- intestine$x head(x[,1:5])
## I5d_3 I5d_4 I5d_6 I5d_8 I5d_9 ## 2200002D01Rik__chr7 3.751327 0.3486954 0.9355945 4.791691 2.4250484 ## 2210407C18Rik__chr11 0.100000 0.5983680 0.3076682 0.100000 1.6469830 ## 2310014L17Rik__chr7 0.100000 0.1000000 0.3076682 0.100000 0.1000000 ## Acsl5__chr19 1.132947 0.5983680 0.5161523 1.651566 0.8719749 ## Actb__chr5 4.813430 2.1173627 3.9683714 2.431937 4.7778317 ## Adh1__chr3 10.256359 7.1820558 0.3076682 7.179934 10.3793249
Moreover, FateID requires a partitioning of cells, which can be generated by any clustering method. For example the RaceID3 algorithm can be used to identify cell clusters, and a partitioning generated by this method is provided as part of the package. The partitioning has to be provided as a vector with integer values and component names corresponding to column names of the expression data frame:
y <- intestine$y head(y)
## I5d_3 I5d_4 I5d_6 I5d_8 I5d_9 I5d_10 ## 2 3 4 2 5 4
Cluster analysis can inform on the presence of mature cell types in the dataset, where cell types of distinct lineages correspond to different clusters (i. e. different numbers of the partition). In the example data, cluster number 6 comprises enterocytes, marked by high expression of the Alpi gene, while cluster 9 represents mature Paneth cells (high expression of Defa24) and cluster 13 mature goblet cells (high expression of Clca3). Other rare cell types are only present in very low numbers and therefore were excluded from the analysis.
As further input for FateID, the endpoints of the differentiation trajectories, i. e. the most mature stages of all distinct cell lineages in the data set have to be defined by a vector of integer numbers representing the corresponding clusters in the partition
tar <- c(6,9,13)
If a partitioning into cell types and states from a prior clustering analysis is not available, FateID can derive a partitioning based on marker gene information. For this strategy, a list of marker gene IDs is needed. Each component of this list contains one or more marker genes of a distinct lineage:
FMarker <- list(c("Defa20__chr8","Defa24__chr8"), "Clca3__chr3", "Alpi__chr1") xf <- getPart(x,FMarker,fthr=NULL,n=5) head(xf$part)
## I5d_3 I5d_4 I5d_6 I5d_8 I5d_9 I5d_10 ## 1 1 1 1 1 1
##  2 3 4
tar <- xf$tar y <- xf$part
The getPart function extracts the top
n cells expressing the markers of one of the lineages most highly and defines an expression threshold by the average expression across these cells. For this inference the expression level is aggregated across all markers of this lineage. The target cluster of a lineage is given by the ensemble of cells with aggregated marker gene expression higher than this threshold. Alternatively, a vector with threshold expression values for all lineages can also be provided as input argument
fthr. The target cluster numbers will reflect the order of the component of the
FMarker list starting at 2. Cluster number 1 comprises all remaining cells that do not exhibit marker gene expression beyond the mean expresion of the top
n cells expressing a marker or the levels in
Once target clusters have been defined, FateID has the option to reclassify all remaining cells using the cells within the target clusters as input.
rc <- reclassify(x, y, tar, clthr=.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL, q=0.9) y <- rc$part
The function returns a partition with the novel assignments after reclassification and can replace the previous partition based on the original target clusters. The purpose of this step is to identify all cells with a pronounced bias towards one of the fates. This step is optional but recommended to obtain larger training sets for random forests. This and other functions can be executed on expression data prior to or after feature selection. In the sample data, the data frame
x contains only genes with variability exceeding a background level of combined technical and biological noise as inferred by RaceID3. Alternatively, the full data frame containing all genes can be used as input.
v <- intestine$v rc <- reclassify(v, y, tar, clthr=.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL, q=0.9) y <- rc$part
The reclassify function also performs a feature selection based on importance sampling, i. e. all features with an importance larger than the q-quantile of the importance distribution for a given class are retained. The reduced expression table is returned by the function and can replace the original input expression data frame:
x <- rc$xf
Utilizing this function is recommended if the input data have not been subject to any other feature selection method.
Feature selection can also be performed utilizing a differential gene expression analysis.
x <- getFeat(v,y,tar,fpv=0.01)
This function compares gene expression within cells of a target cluster to the ensemble of all remaining cells and identifies genes that are significantly up-regulated in a target cluster with a p-value lower than
fpv. The function returns a reduced expression data frame, which can be used for the subsequent analysis. In general, the reclassify function is more recommended for feature selection, since it reflects the information used for the random forest classification.
The core function of FateID computes the fate bias for each cell in the dataset excluding cells within the target cluster. These cells are assigned to the lineage representing the respective target cluster with a probability of one and this probability does not change during inference of the fate bias of all other cells:
tar <- c(6,9,13) x <- intestine$x y <- intestine$y fb <- fateBias(x, y, tar, z=NULL, minnr=5, minnrh=10, adapt=TRUE, confidence=0.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL)
Apart from the (feature selected) expression data frame
x, the partition
y, and the vector of target clusters
fateBias function takes further arguments as input. The optional argument
z is a cell-to-cell distance matrix utilized to identify non-classified cells in the immediate neighborhood of all cells that have been classified as one of the target clusters in the previous iteration. By default this distance matrix will be computed as
z=1-cor(x), but if other distance measures are preferred, a distance matrix can be provided by this argument.
The FateID algorithm computed by the
fateBias function performs an iterative calculation. It starts with a set of cells representing each target cluster. For each target cluster, the
minnr neighboring cells with the shortest median distance to all cells in the target cluster are extracted. The ensemble of the neighboring cells of all target clusters represents the test set of the next iteration. The
minnr parameter thus controls the step size of the algorithm. In each iteration,
minnr cells times the number of target clusters are classified and can contribute to the training set in the next iteration.
The training set of this iteration comprises all cells assigned to one of the target clusters and the response vector is given by the partition of these cells. The classification of the test set is done based on the random forest votes: If a cell receives significantly more random forest votes for one target cluster versus all other clusters (based on sampling statistics with a p-value threshold of 0.05 it is assigned to this target cluster and contributes to the training set for the next iteration. All cells without a significant fate bias towards any one of the target clusters are not incorporated into the training set for the next iteration. However, the fraction of votes, which can be interpreted as a fate probability, is recorded and stored for all cells. Another important parameter controls which cells contribute to the training set for a given iteration. At most
minnrh cells from each target cluster contribute to the training set. These cells are selected as the
minnrh cells with the shortest distance to any cell within the current training set. This parameter controls the gene expression horizon on the differentiation trajectory taken into account for the classification of the test set. If minnrh is set to Inf then all previously classified cells with a significant fate bias for one of the target clusters contribute to the training set. However, if gene expression changes follow complex dynamics along a differentiation trajectory, it can become detrimental to include very distant cells expressing maturation markers that are not expressed during the earliest stages of differentiation. It is generally advised to confine the expression horizon to smaller values in order to increase the specificity of the algorithm. However, the training set should still be large enough to warrant a confident classification. As a rule of thumb, the
minnrh parameter should be set to 20 or larger values, depending on the size and coverage of the dataset. If the input data set is large and a large number of cells are available in the dataset for all lineages covering the entire differentiation trajectory, this parameter can be increased. We recommend testing the robustness of the results to changes in
minnrh. The step size
minnr should be selected based on similar consideration. However, it is generally recommended to keep this number lower in order to avoid misclassification due to insufficient resolution. For the majority of datasets with several hundreds of cells we use
FateID also permits a dynamic test set size, where the
minnr parameter is adapted separately for each target cluster based on the classification success in the previous iteration. If
adapt=TRUE, the number of successfully classified cells is determined in each iteration, i.e. the number of cells with a minimum fraction of votes given by the
confidence parameter for the target cluster, which gave rise to the inclusion of the cell in the test set. Weights are then derived by dividing this number by the maximum across all target clusters after adding a pseudocount of 1. The test set size
minnr is rescaled for each cluster by the respective weight in the next iteration.
This leads to decreasing test set sizes, and hence, slower classification, if the previous classification success of the local neighbourhood has been low. Such an adaptive scheme is important if the number cells on different branches exhibits large differences. If classification on a short branch reaches the naïve compartment, where classification succes can be expected to decrease, the progression on this branch in future iterations slows down, while classification of more mature stages on a highly populated branch still progresses at full speed.
As an alternative approach the FateID algorithm offers classification based on distances to all other cells. When
use.dist is set to
TRUE, then the distance matrix
1-cor(x)) is interpreted as feature matrix. The remaining arguments are control parameters of the random forests algorithm and usually do not have to be adjusted. As outlined in the man packages of the
fateBias function it returns a list of five components. The votes component is a data frame of random forest votes of all cells for each of the target clusters. The column names are given by a concatenation of a
t with the number of the target cluster. The row names are given by cell IDs. The probs component has the same structure, but the votes for each cell are normalized to one in order to represent fate probabilities. The
tr component is a list of vectors. Each vector contains all cell IDs of cells with a significant fate bias towards the corresponding target cluster. Significant fate bias means significantly more votes for a given cluster than for any other cluster based on sampling statistics with a p-value below 0.05. The fourth component is vector with all cell IDs ordered by random forest iteration in which they have been classified. The last component is a list of random forest objects produced by the function randomForest from the randomForest package object for all iterations.
Various dimensional reduction methods are commonly used for single cell transcriptome analysis in order to visually inspect the cell population structure. The FateID package computes a number of dimensional reduction representations to enable visualization of the fate bias and pseudo-temporal ordering by principal curve computation.
dr <- compdr(x, z=NULL, m=c("tsne","cmd","umap"), k=2, lle.n=30, tsne.perplexity=30, seed=12345)
The first two input parameters are the same as the ones to the
fateBias function. The parameter
k indicates the dimensions for which the dimensional reduction representations are computed. Typically, one wants to inspect data visually in two or three dimensions. However, it is possible to also compute dimensional reductions to more than three dimensions and inspecting the data after projecting onto a subset of dimensions. The remaining parameters are main control parameter for the various algorithms used for dimensional reduction. The function performs computation of a t-SNE map (Maaten and Hinton 2008) using the
Rtsne package, classical multi-dimensional scaling using
cmdscale from the
stats package, and locally linear embedding using the
lle function from the
lle package. Umaps are computed with the
umap package. To speed up the computation, only a subset of dimensionality reduction representations can be selected as input parameter
All results can be plotted by the
The dimensional reduction representation with a highlighting of the partition can be plotted for any of the dimensional reduction algorithms in any of the dimensions computed by
compdr. For example, a umap can be plotted in two dimensions:
Plotting in three dimensions opens an interactive
RGL device to allow rotation of the plot and zooming in and out:
The fate bias can now be highlighted in the dimensional reduction representation by providing the name of the target cluster additional argument (concatenation of a
t and the cluster number, e.g.
t6) and the