\documentclass{article}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} %\VignetteIndexEntry{Discrimination and Calibration} \SweaveOpts{keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} % I had been putting figures in the figures/ directory, but the standard % R build script does not copy it and then R CMD check fails \SweaveOpts{prefix.string=surv,width=6,height=4} \newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth] {surv-#1.pdf}} \newcommand{\bhat}{\hat \beta} %define "bhat" to mean "beta hat" \newcommand{\Mhat}{\widehat M} %define "Mhat" to mean M-hat \newcommand{\zbar}{\bar Z} \newcommand{\lhat}{\hat \Lambda} \newcommand{\Ybar}{\overline{Y}} \newcommand{\Nbar}{\overline{N}} \newcommand{\Vbar}{\overline{V}} \newcommand{\yhat}{\hat y} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\co}[1]{\texttt{#1}} \newcommand{\twid}{\ensuremath{\sim}} \newcommand{\Lhat}{\hat\Lambda} \setkeys{Gin}{width=\textwidth} <>= options(continue=" ", width=70) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=10) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #ensure default library("survival") palette(c("#000000", "#D95F02", "#1B9E77", "#7570B3", "#E7298A", "#66A61E")) @ \title{Discrimination and Calibration} \author{Terry Therneau} \begin{document} \maketitle \clearpage \section{Introduction} This vignette is in progress and not yet complete. \section{Discrimination} \subsection{Pseudo $R^2$ measures} There have been many attempts to define an overall ``goodness of fit'' criteria for the Cox model which would be parallel to the widely used $R^2$ of linear models. A direct analog is hampered by the issues of censoring and scale. Censoring is the major technical impediment: how do we define error for an observation known only to be $> t_i$? A potentially larger issue is that of scale: if we have two $(t, \hat t)$ pairs of (3 m, 6 m) and (9 yr, 10 yr), which one represents the greater error? On an absolute scale the second is the larger difference, but in a clinical study the first may be more important. Simon xxx has a nice discussion of the issue. Rather than a direct extension of $R^2$, subject to the issues raised above, the most common approach has been based on pseudo-$R^2$ measures. These re-write the linear model statistic in another way, and then evaluate the alternate formula. \begin{align} R^2 &= 1 - \frac{\sum (y_i - \hat y_i)^2}{\sum (y_i - \overline y)^2} \nonumber \\ &= 1- \left(\frac{LL(\mbox{intercept})}{LL(\mbox{full})}\right)^{2/n} \label{coxsnell} \\ &= \frac{\mbox{var} (\hat y)} {\mbox{var} (\hat y) + \sigma^2} \label{kent} \end{align} Equation \eqref{coxsnell} is the Cox and Snell formula, for a Cox model replace the linear model log-likelhood ($LL$) with the partial likelihood of the null and fitted models. This gives the measure proposed by Nagelkerke \cite{Nagelkerke84} which was part of the standard printout of \code{coxph} for many years. It has, however, been recognized as overly sensitive to censoring. The measure of Kent and O'Quigley \cite{Kent98} is based on \eqref{kent}. Replace $\hat y$ with the Cox model linear predictor $\eta= X \hat \beta$ and $sigma^2$ by $\pi^2/6$. The latter is based on an extreme value distribution, and the equivalence of the Cox model to a transformation model. Royston and Sauerbrei replace the risk scores $\eta$ with a normal-scores transform \begin{align*} s_i &= \Phi^{-1}\left( \frac{r_i - 3/8}{n + 1/4} \right) \\ r_i &= {\rm rank}(\eta_i) \end{align*} then re-fit the Cox model using $s$ as the single covariate. Since ${\rm var}(s) =1$ by design, the variance of the normalized risk score will be captured by the coefficient $\beta^2$ from the re-fit; while the Cox model estimate of the variance of $\beta$ is used as an estimate of variance. They then further define a measure $D = \beta \sqrt{8/\pi}$. The rationale is that $E(s | s>0) = \sqrt(2/\pi)$, so $D$ represents the hazard ratio between a random draw from the bottom half of the $s$ distribution to a random draw from the top half. This value is then `comparable' to the hazard ratio for a binomial covariate such as treatment. (We prefer to use the 25th and 75th quantiles of the risk score, untransformed, for this purpose, i.e. the ``middle of the top half'' versus the ``middle of the bottom half'', rather than mean(bottom) vs. mean(top).) A issue with the Royston and Sauerbrei approach is that although the risk scores from a fitted Cox model will sometimes be approximately symmetric, there is no reason to assume that this should be so. In medical data the risks are often right skewed: it is easier to have an extremely high risk of death than an extremely low one (there are no immortals). For the well known PBC data set, for instance, whose risk score has been validated in several independent studies, the median-centered risk scores range from -2 to 5.4. Remember that even in the classic linear model, $\hat \beta$ and the residuals are assumed to Gaussian, but no such assumption is needed for $y$ or $\hat y$; in fact such a distribution is uncommon. See ``Health, Normality and the Ghost of Gauss'' \cite{Elveback70} for a good discussion of this topic. {G\"{o}en and Heller \cite{Goen05} create a pseudo-concordance that also uses only the risk scores. It is based on the fact that, if proportional hazards holds, then the time to event $t_i$ and $t_j$ for two subjects satisfies \begin{equation} P(t_i > t_j) = \frac{1}{1 + \exp(\eta_j - \eta_i)} \end{equation} They then propose the estimate \begin{equation} C_{GH} = \frac{2}{n(n-1)} \sum_{\eta_i > \eta_j} \frac{1}{1 + \exp(\eta_j - \eta_i)} \end{equation} which can be translated from the (0,1) concordance scale to a (-1, 1) range via $R^2_{GH} = 2C -1$, if desired. If there is no relationship between $X$ and $t$ then $\beta=0$ and $\eta=0$, leading to a concordance of 1/2. A claimed advantage of the GH measure over the usual concordance is that it is not affected by censoring. A primary disadvantage is that it is based on the assumption that the model is completely correct; the assumption that proportional hazards holds for all time, even well beyond any observed data, is particularly unlikely. All of the above measures are computed by the \code{royston} command. \subsubsection*{Evaluation of new data} When applying these measures to new data, for an existing model, it is necessary to first compute a scaling factor. That is, compute the vector of risk scores $\eta = X\beta$ using the coefficients of the prior model and covariates from the new data, and then fit a new Cox model using the response from the new data with $\eta$ as the only predictor. The rescaled risk scores $\hat\beta \eta$ are then used in the formulas. To see why rescaling is necessary, assume a case where the validation data set was an exact copy of the development data set, but with an error: at some point the survival time column had been randomly re-ordered but without perturbing the other columns. The survival time is now unrelated to the linear predictor $\eta$, yet the values of the pseudo R-squared and C statistic would be unchanged from the original fit, since $\eta$ is unchanged. In this extreme case the rescaling fit would have coefficient 0, leading to the appropriate conclusions. Rescaling is done automatically by the \code{royston} command when the call includes a \code{newdata} argument. \subsection{Concordance} The most commonly used measure of discrimination is the concordance. [merge with concordance vignette] \section{Calibration} \section{Computing IPC weights} \subsection{Ties} The computation of inverse of probability of censoring (IPC) weights would appear to be a straightforward task, but has some tricky edge cases. The simple algorthm, used by many, is to do an ordinary Kaplan-Meier with censoring as the endpoint, what is somtetimes called the reverse KM, and then simply read off numbers from the graph, i.e., use the following bit of code in the kernel. \begin{verbatim} cfit <- survfit(Surv(time, 1-status) ~ 1, data=mydata) cprob <- summary(cfit, times= mydata$time)$surv) \end{verbatim} There are two problems with this. The first is that the Kaplan-Meier creates a right continuous estimate, and IPC weights need to be left continuous. That is, if we want a censoring weight at time $t$, this should be based on the probability of being censored \emph{before} time $t$; $C(t-)$. The \code{summary} function returns $C(t)$. The second is that if the starting data set includes any time points with both an event and a censoring at that time point, the observation with an event is \emph{not} at risk for censoring at that time, but in a reverse KM they will be counted. Complicating things even further is the issue of round-off error as dicussed in the vignette on tied times. A general solution to the above is shift all of the censoring times right by a small amount $\epsilon$, before using \code{survfit} to compute $C(t)$. This resolves the issue of tied death and censoring times. When time values are looked up on the resulting curve, the result will effectively be based on the right-continuous version of $C(t)$. Care much be used to choose $\epsilon$ sufficiently small so that the shifted values do not overstep another unique time point, and sufficiently large that the shift is not lost due to roundoff error (in floating point arithmetic 1000 + 1e-16 =1000). Then, at any chosen cutoff $\tau$, the IPW weights will be 0 for any observation censored before $\tau$, and $1/C(\min(t_i, \tau))$ otherwise, where $t_i$ is the follow-up time for observation $i$. There are two fairly simple checks on any implementation of IPW. The first is to compute weights using the maximum uncensored time as the cutoff $\tau$, then the weighted emprical CDF of $t$ should exactly agree with the Kaplan-Meier. The second is that the sum of the weights must equal $n$, the number of observations, for any cutoff $\tau$. Ignoring the issue of tied event and censoring times creates weights that are too small, and using right-continuous vs. left continuous $C$ creates weights that are too large. \subsection{Counting process data} Modification of the redistribute to the right (RTTR) algorithm for counting data is straightforward. First, only actual censorings cause a redistribution. For example, assume we have a data set with time-dependent covariates, and a subject with three (time1, time2, status) intervals of (0, 5, 0), (5,18, 0) and (18, 25, 0). Only the last of these, time 25, is an actual censoring time. For counting process data the \code{rttright} function will insist on an \code{id} statement to correctly group rows for a subject, and makes use of \code{survcheck} to insure that there are no gaps or overlaps. (The \code{survfit} routine imposes the same restriction). A general form of the RTTR algorithm that allows for multiple states has the following features \begin{enumerate} \item Censoring weights $c_i(t)$ for each observation are explicitly a function of time, and sum to 1. \item When an observation is censored, the weight it redistributied to all other observations in the same state. \item At any time, for any given state, the non-zero weights for all observations in that state are proportional to prior case weights $w_i$. \end{enumerate} The estimated probability in state $j$ is estimated as a sum of weights of those currently in state $j$. \begin{equation*} p_j(t) = \sum_{s_i=j} c_i(t) \end{equation*} This estimate agrees with the Aalen-Johansen estimate. Point 3 means that whenenever an observation changes state, this necessates re-balancing of the weights in the new state. As an example assume a model with three states 1 $\rightarrow$ 2 $\rightarrow$ 3, and 5 subjects who start in state 1, and the following data set where a state of 0 = censoring. At time 2.1 the weights are 1/5, 0, 4/15, 4/15, 4/15, with observation 1 in state 2 and 3--5 in state 1, and at time 3 observation 3 transitions to state 2. At time 4, observation 1 transitions to state 3 with a weight of (1/5 + 4/15)/2. <>= tdata <- data.frame(id=c(1,1,1, 2, 3,3, 4, 5), time1=c(0, 1, 4, 0, 0, 3, 0, 0), time2=c(1, 4, 8, 2, 3, 6, 7, 8), state=c(2, 3, 0, 0, 2, 0, 0, 0)) tdata @ For an absorbing state, i.e., one with no departures, the rebalancing step is not technically necessary since it does not change the sum. As a practical matter in the code, there is actually no need to rebalance until there is a departure of an observation to another state. This fits well with a design decision of the survival package to not force declaration of the aborbing states by the user, but to simply notice them as those states where no one departs. For the case of a simple alive-dead or competing risks model, the weights generated by the RTTR approach also have a simple representation as $1/G(t)$ for some censoring distribution $G$. The keys that makes it work for these two cases are that everyone starts in the same state and that subjects are only censored from that state. There is in essence only one censoring distribution to keep track of. There is not natural way (that we have seen) to reflect RTTR weights that involve rebalancing as the inverse of a censoring distribution $G$. \end{document}