%\VignetteIndexEntry{Using Oncotree for the ovarian cancer CGH data} %\VignetteDepends{lattice,boot} \documentclass[reqno]{amsart} \usepackage[margin=0.8in]{geometry} \usepackage{graphicx} \title{Using the Oncotree package} \author[A. Szabo]{Aniko Szabo, Kenneth Boucher and Lisa Pappas} \SweaveOpts{echo=TRUE} \begin{document} \setkeys{Gin}{width=0.5\textwidth} \begin{abstract} This paper shows a short example of building and exploring oncogenetic trees using the Oncotree package. A detailed description of the theory of oncogenetic trees can be found in \begin{itemize} \item Desper, R.; Jiang, F.; Kallioniemi, O.; Moch, H.; Papadimitriou, C. and Sch\"{a}ffer, A. A. ``Inferring tree models for oncogenesis from comparative genome hybridization data.'' \emph{Journal of Computational Biology}, 1999, 6, 37-51. \item Szabo, A. and Boucher, K. ``Estimating an oncogenetic tree when false negatives and positives are present.'' \emph{Mathematical Biosciences}, 2002, 176, 219-236. \item Szabo, A. and Boucher, K. ``Oncogenetic trees'' in \emph{Handbook of cancer models with applications} Tan, Hanin (ed.) World Scientific, 2008. \end{itemize} A short introduction is given in doc/Oncotree.pdf. \end{abstract} \maketitle We start by loading a dataset. The package contains an example dataset: <>= options(width=100) ps.options(colormodel="rgb") <>= library(Oncotree) data(ov.cgh) str(ov.cgh) @ Based on these data, we construct the oncogenetic tree using the default $\ell_2$-distance error function to estimate the false-positive and false-negative error rates. <>= ov.tree <- oncotree.fit(ov.cgh) @ The fitted tree can be examined several ways: \texttt{print}ing it produces a quick summary, but the result of \texttt{plot}ting is easier to interpret (the plots are shown in Figure~\ref{F:treeplot}). <>= ov.tree plot(ov.tree, edge.weights="est") <>= pstree.oncotree(ov.tree, edge.weights="est", shape="oval") @ \begin{figure} \begin{minipage}{0.5\textwidth} \setkeys{Gin}{width=\textwidth} <>= plot(ov.tree, edge.weights="est") @ \end{minipage} \begin{minipage}{0.45\textwidth} \includegraphics[width=\textwidth]{PlotForVignette} \end{minipage} \caption{Fitted oncogenetic tree for the \texttt{ov.cgh} data set.}\label{F:treeplot} \end{figure} \setkeys{Gin}{width=0.5\textwidth} We can compare the observed and fitted marginal occurrence frequencies of the mutations (the distance between these two was minimized for the error-rate estimation). The plot is shown in Figure~\ref{F:marg}. <>= print(obs <- colMeans(ov.tree$data)) print(est <- marginal.distr(ov.tree, with.errors=TRUE)) #plot is in Figure 2 barplot(rbind(obs[-1],est[-1]), beside=T, legend.text=c("Observed","Fitted"), main="Marginal frequencies of occurrence") @ \begin{figure} <>= barplot(rbind(obs[-1],est[-1]), beside=T, legend.text=c("Observed","Fitted"), main="Marginal frequencies of occurrence") @ \caption{Observed and fitted frequencies of occurrence of each event.}\label{F:marg} \end{figure} In addition to the marginal frequencies, it is possible to estimate the entire joint distribution generated by the tree: <>= dd <- distribution.oncotree(ov.tree, with.errors=TRUE) head(dd) @ Using the overall joint distribution, it is straightforward to obtain marginal joint distributions (2- or higher way) if needed (the plot is shown in Figure~\ref{F:marg2}). <>= #estimated probabilities of 2 events print(est2way <- t(data.matrix(dd[2:8])) %*% diag(dd$Prob) %*% data.matrix(dd[2:8])) #observed probabilities of 2 events print(obs2way <- t(ov.tree$data[,-1]) %*% ov.tree$data[,-1]/nrow(ov.tree$data)) oe.diff <- obs2way-est2way oe.diff[lower.tri(oe.diff)] <- NA #clear half of symmetric matrix for plotting require(lattice) #the plot is in Figure 3 levelplot(oe.diff, xlab="", ylab="", scales=list(x=list(alternating=2), tck=0), main="Observed - Expected probabilities of co-occurrence of events") @ \begin{figure} <>= print(levelplot(oe.diff, xlab="", ylab="", scales=list(x=list(alternating=2), tck=0), main="Observed - Expected probabilities of co-occurrence of events")) @ \caption{Goodness-of-fit plot: difference between observed and expected probabilities of two events being observed.}\label{F:marg2} \end{figure} Another way to evaluate goodness-of-fit is through bootstrap resampling of the data. Two approaches are implemented: a parametric bootstrap that assumes that the model is correct and a non-parametric bootstrap. The plot is shown in Figure~\ref{F:boot}. <>= set.seed(43636) ov.boot <- bootstrap.oncotree(ov.tree, type="nonparam", R=1000) ov.boot opar <- par(mfrow=c(3,2)) #the plot is in Figure 4 plot(ov.boot, minfreq=45) par(opar) @ \setkeys{Gin}{width=0.8\textwidth} \begin{figure} <>= opar <- par(mfrow=c(3,2)) #the plot is in Figure 4 plot(ov.boot, minfreq=45, cex=1) par(opar) @ \caption{The most frequently occurring bootstrap trees.} \label{F:boot} \end{figure} \setkeys{Gin}{width=0.5\textwidth} The non-parametric bootstrap gives an estimate of the reconstruction confidence: the original tree was obtained 83 times out of 1000 resamples, so the estimated confidence is 8.3\%. We can look at the frequency of edge occurrences in the bootstrapped trees: <>= ov.boot$parent.freq @ It is clear that some edges are really stable: Root $\rightarrow$ 8q+, 8q+ $\rightarrow$ 3q+, root $\rightarrow$ 1q+, all with confidence $>80\%$, while other edges are less stable (for example, 8p- is the child of 8q+ about as often as of 5q-). \end{document}