\documentclass{article}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{prefix.string=adjcurve,width=6,height=4} \setkeys{Gin}{width=\textwidth} %\VignetteIndexEntry{Brier scores and the redistribute-to-the-right algorithm} <>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=8) #text in graph about the same as regular text library(survival, quietly=TRUE) @ \title{Brier scores and the redistribute-to-the-right algorithm} \author{Terry Therneau} \date{March 2023} \newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth] {redistrib-#1.pdf}} \newcommand{\code}[1]{\texttt{#1}} \begin{document} \maketitle \section{Kaplan-Meier} The \code{rttright} function in the package illustrates something called the redistribute to the right algorithm. It was introduced by Efron as another way to compute the Kaplan-Meier estimate, but more so as an alternate way to understand the Kaplan-Meier. The function is much slower and less capable than \code{survfit}, but it has been useful as part of the test suite. Its primary purpose in the package, however, is for illustration. Consider a set of subjects followed forward in time, each with an initial case weight of 1. Whenever someone is censored, evenly distribute the subject's current case weight to all those who are still at risk, but not censored. The basic idea is rather like someone distributing their assets, equally, to all the remaining relatives before exiting the stage. \begin{table} \centering \begin{tabular}{cccccc} & &\multicolumn{4}{c}{Weight at time} \\ Id & Survival & 0 & 2+ & 4+ & 5+ \\ \hline a & 1 & 1 &1 & 1 & 1 \\ b & 2 & 1 &1 & 1 & 1 \\ c & 2+& 1 & 0 & 0 & 0 \\ d & 3 & 1 & 8/7& 8/7 & 8/7 \\ e & 4+& 1 & 8/7 & 0 & 0 \\ f & 4+& 1 & 8/7 & 0 & 0 \\ g & 5 & 1 & 8/7 & 12/7& 12/7\\ h & 5+& 1 & 8/7 & 12/7 & 0 \\ i & 8 & 1 & 8/7 & 12/7 & 18/7 \\ j & 9 & 1 & 8/7 & 12/7 & 18/7 \\ Total & & 10 & 10 & 10 & 10 \end{tabular} \caption{Hand calculation of the new RTTR weights, after each censoring time.} \label{tab1} \end{table} Table \ref{tab1} shows a hypothetical example with 10 subjects whose data is 1, 2, 2+, 3, 4+, 4+, 5, 5+, 8, and 9; the numbers are the follow-up times for each subject and the `+' indicates censoring. All start with a weight of 1. At time 2+ subject c transfers their weight evenly to the 7 subjects with longer survival (d--j), leaving each of them with a weight of 1 + 1/7 = 8/7. For tied times, censors are assumed to occur after deaths, so subject b does not participate in the redistribution. At time 4+ subjects e and f each transfer a weight of 8/7 evenly to the remaining 4 (g--j), leaving each of those with a new weight of 8/7 + (1/4)(8/7) + (1/4)(8/7) = 12/7. Finally, at time 5+ subject h transfers their weight of 12/7 equally to i and j, leaving each of those with a weight of 12/7 + (1/2)(12/7) = 18/7. After each transfer, the total sum of the weights is unchanged. If we form a (weighted) empricial CDF of the final data, it will have jumps of 1/10 at times 1 and 2, 8/70 at time 3, 12/70 at time 5, and 18/70 at times 8 and 9, which turns out to be exactly the step sizes in the Kaplan-Meir estimate. (In practice, the \code{rttright} routine first renorms the weights so as to sum to 1, thus the final weights will not have to be divided by 10 when creating the empirical CDF. However, the table was easier to illustrate initial weights of 1 rather than 1/10.) As a reminder, the KM is defined as the seqential product of the fraction who were at risk but did not have an event, at each event time. That is <>= km <- rbind(time= c(0, 1, 2, 3, 5, 8, 9), km = cumprod(c(1, 9/10, 8/9, 6/7, 3/4, 1/2, 0/1))) km @ Another way to compute the RTTR weights is by using the inverse probability of censoring. Let $G(t)$ be the KM calculation where censoring is considered as the event. This has the values shown below. <>= G <- cumprod(c(1, 7/8, 4/6, 2/3)) names(G) <- c(0, "2+", "4+", "5+") G 7/G @ Because censoring happens after death, at time 2+ the product term is 7/8, 7 uncensored out of 8 at risk for censoring, whereas the term was 8/9 in the KM for death. (The fairly common practice of using \code{survfit(Surv(time, 1-status) ~1)} does not give correct values for $G$, if there are any tied death/censoring pairs). $1/G$ has values of 1/10, 8/7, 12/7 and 18/7. That is, assigning as weight of $1/G(t)$ to an event at time $t$ and 0 to censored values exactly reproduces the RTTR weights. The same care that about tied times used to compute $G$ also applies to its use. Formally $G(t)$ is defined as a left continuous function: the death at time 2 is assigned a weight based on $G(2- \epsilon) =1$, rather than $G(2+\epsilon) = 7/8$. Again, this is an unfortunately common coding error. We have used the \code{rttright} function as a code check for our $1/G$ methods as well as others. The connection between inverse probability of censoring (IPC) weights and the RTTR is connected to survey sampling ideas. At any chosen time point $s$ the RTTR for those still alive at $s$ will be larger than one, each subject represents both themselves and some fraction of those who have been censored. The survey sampling version of the same idea gives a weight of 1/Pr(not censored) to each of these subjects, which turns out to be the same. The addition of case weights add a small wrinkle to the algorithm: when the weight for censored subject is redistributed, it now done proportional to the case weights. Practially, it only means that the code has keep the original case weights as a separte vector from the ongoing, redistributed weight for each observation. For the IPC approach, new weights are $w_i/G(t_i+)$ where $w$ is the original case weight. \section{Competing risks} For the case of competing risks all the above arguments translate directly. Assume for instance that there were three causes of death A= cardiac and B=cancer and C= other, and that the deaths at times 1, 3 and 9 were of type A, times 2 and 5 of type B, and the death at time 8 of type C. After application of the RTTR or IPC process we have new weights, as before. The cardiac death curve will start at 0 and have jumps of size 1/10, 12/70 and 18/70 at time 1, 5 and 9, respectively, the cancer death curve jumps of size 8/70 and 12/70 at times 3 and 5, and the other death curve a jump of size 18/70 at time 8. This matches the result of an Aalen-Johansen curve, i.e., the result of \code{survfit} when the status variable has been replaced with a factor variable with levels of censor, cardiac death, cancer death, and other death. An incorrect, though common error for competing risks data is to estimate one of the failure curves using an ordinary Kaplan-Meier, with status variable 1 for the endpoint of interest and 0 for censoring \emph{and} the other endpoints. The RTTR approach illustrates the folly of this: the KM of time to cardiac corresponds to an RTTR calculation which will redistribute the weight for subject b at $t=2$, the point of their cancer death. This implicity assumes that subject b is still at risk for a cardiac death sometime in the future and leads, not surprisingly, to an overestimate of cumulative cardiac death risk. A correct RTTR only redistrubutes actual censorings. \section{Brier score} A well known goodness of fit estimate for binomial data is the Brier score, which is essentially an ordinary correlation coefficient using the 0/1 response as $y$ and the prediction $\hat p$ from a fitted model. Recall the usual defintion \begin{equation*} R^2 = 1 - \frac{\sum (y_i - \hat y_i)^2}{\sum (y- \overline y)^2} \end{equation*} To adapt $R^2$ for survival data involves 3 steps. The first two are straightforard: decide on a particular time point $\tau$, then use P(death time $\le tau$) as the response $y$; and decide on the intercept value $\overline y$ to be used as reference. For the second of these the usual choice is $1- KM(\tau)$, the estimated overall probability of death by time $\tau$ based on the Kaplan-Meier. Below we will use the Rotterdam data set, contains long term death and recurrence status for a cohort of breast cancer patients; there is a median follow-up of 9.3 years and a median recurrence free survival of 6.9 years. <>= # recurrent free survival (earlier of death or progression) # see help(rotterdam) for explanation of ignore variable ignore <- with(rotterdam, recur ==0 & death==1 & rtime < dtime) rfs <- with(rotterdam, ifelse(recur==1 | ignore, recur, death)) rfstime <- with(rotterdam, ifelse(recur==1 | ignore, rtime, dtime))/365.25 rsurv <- survfit(Surv(rfstime, rfs) ~1, rotterdam) rsurv ybar <- 1- summary(rsurv, time=4)$surv rfit <- coxph(Surv(rfstime, rfs) ~ pspline(age) + meno + size + pmin(nodes,12), rotterdam) psurv <- survfit(rfit, newdata= rotterdam) dim(psurv) yhat <- 1- summary(psurv, time=4)$surv @ For the above fit, say we would like to evaluate the Brier score at $tau=4$ years. An overall probability of death is easily extracted from the KM \code{rsurv}. For $\hat p$ we first obtain all \Sexpr{nrow(rotterdam)} predicted survival curves based on the fitted model, one for each subject, and then read off the prediction at 4 years for each subject. The third ingredient of $R^2$ is the 0/1 alive/dead status $y_i$ of each subject at 4 years, and raises an immediate difficulty: what value should be used for subject 33, who is censored at 2.8 years? We simply do not know what their alive/dead status will be at 4 years. There are 62 such subjects. One very bad idea is to simply leave them out of the calculation. A better one is to employ the RTTR algorithm, stopping at 4 years. The result is a weight vector which is 0 for the 62 early censorings, each one's weight has been appropriately redistributed distributed over those with a longer follow-up. <>= wt4 <- rttright(Surv(rfstime, rfs) ~ 1, times =4, rotterdam) table(wt4 ==0) brier1 <- sum(wt4 * (rfs- yhat)^2)/ sum(wt4) brier0 <- sum(wt4 * (rfs- ybar)^2) / sum(wt4) r2 <- 1- (brier1/brier0) temp <- c(numerator= brier1, denominator = brier0, rsquared = r2) round(temp,3) @ Formally, the value \code{brier1} above is the Brier score. The result is far more useful in the r-squared form, however, since the normalizing value (denominator) can have a large variation across cutoff times. Figure \ref{brierfig} shows results for a Rotterdam data for a range of cutoff times. When the overall survival is above .8 the two components of the score are small. The R-squared value has also been called the ``index of precision'' \cite{Kattan18}. This computation for multiple cutoff times is more efficiently done by the \code{brier} function. <>= cutoff <- seq(1, 12, by=.1) bfit <- brier(rfit, cutoff) names(bfit) @ \begin{figure} <>= oldpar <- par(mar=c(5,5,1,1), mfrow=c(1,2)) b0 <- bfit$brier/(1-bfit$rsquare) matplot(cutoff, cbind(b0, bfit$brier, bfit$rsquare), type='l', lwd=2, lty=3:1, col=1, xlab="Cutoff (tau)", ylab="Value") legend(4, .2, c("B0", "B1", "R-square"), lty=3:1, lwd=2, col=1, bty='n') abline(v=2, lty=3) plot(rsurv, xmax=12, lwd=2, conf.int=FALSE, xlab="Years since enrollment", ylab="RFS") abline(v=2, lty=3) par(oldpar) @ \caption{Components of the Brier score at .1 year increments from 1 to 12 years. B0 = the denominator, the mean squared error using the overall KM for prediction. B1 = the numerator, the mean squared error using the model's predictions of survival.} \label{brierfig} \end{figure} \section{Time dependent AUROC} A parallel use for the RTTR is to compute a measure refered to as the TD-ROC, though we think that a more descriptive label is dichotomized time ROC or DT-ROC. Essentially: dichotomize time at a chosen point $\tau$, compute the concordance $C$ between this binomial outcome variable and the prediction score of a model. Then since concordance and AUROC are identical for a binomial outcome, relabel the result. As in the Brier score above, the key issue is dealing with subjects who are censored before time $\tau$. An alternative to this which does not involve dichotomization or reweighting is the thresholded concordance. In many studies, we may not be interested in the accuracy of prediction beyond a particular time threshold. Examples might be a cancer trial where the drug effect for a subject, if any, is assumed to occur within the first 3 years, or a risk score that will be used to decide patient managment over a 2 year time horizon. In both cases, predictions at longer times would be irrelevant to the question at hand. A \code{ymax=3} argument to the concordance function treats any observed outcomes of over 3 years as tied, causing those pairs to be ignored in the calculation. <>= rwt <- rttright(Surv(rfstime, rfs) ~1, rotterdam, times= cutoff) cstat <- matrix(0, length(cutoff), 4) for (i in 1:length(cutoff)) { temp1 <- concordance(rfit, ymax= cutoff[i]) ycut <- ifelse(rfstime > cutoff[i], 1, 0) temp2 <- concordance(ycut ~ rfit$linear.predictor, weight= rwt[,i], reverse=TRUE) cstat[i,] <- c(temp1$concordance, temp2$concordance, sqrt(temp1$var), sqrt(temp2$var)) } dimnames(cstat) <- list(cutoff, c("Threshold C", "Discrete time C", "sd1", "sd2")) @ \begin{figure} <>= oldpar <- par(mfrow=c(1,2), mar=c(5,5,1,1)) yhat <- cbind(cstat[,1] + outer(cstat[,3], c(0, -1.96, 1.96), "*"), cstat[,2] + outer(cstat[,4], c(0, -1.96, 1.96), "*")) matplot(cutoff, yhat[,c(1,4)], ylim=range(yhat), type='l', lwd=2, lty=1, xlab="Cutoff (tau)", ylab="Concordance") ii <- match(c(1,2,4,6,8, 10,12), cutoff) eps <- .02 segments(cutoff[ii]-eps, yhat[ii, 2], cutoff[ii]-eps, yhat[ii,3]) segments(cutoff[ii]+eps, yhat[ii, 5], cutoff[ii]+eps, yhat[ii,6], col=2) matplot(cutoff, cstat[,3:4], type='l', lty=1, lwd=2, ylim=c(0, max(cstat[,3:4])), xlab= "Cutoff (tau)", ylab="Standard error") par(oldpar) @ \caption{Left panel: Time threshold (black) and dichotomized time (red) concordance estimates for the fitted Rotterdam model, along with 95\% confidence intervals. Right panel: comparative standard errors.} \label{tdroc} \end{figure} Figure \ref{tdroc} shows the two concordance estimates. The standard errors start out similar, at early times there are few deaths and so both estimates have limited information. By year 6 the threshold C nearly at the non-thresholded value and the standard error has stabilized. The dichotomized time DT-C values are always larger, and have a standard error that first falls and then grows, the second due to an increased number of large case weights that are assigned to only a few subjects. As a comparison, the $R^2$ estimate of .11 is about .55 when put on the same scale ($R^2$ goes from -1 to 1 and $C$ from 0 to 1). It is not fair to say that one is better than another based on overall size since the three measures are estimating different targets. A simple yes/no outcome is easier to predict than the full rank vector, which is in turn easier to predict than a continous probability. \section{Covariate based RTTR} The RTTR algorithm serves as a good reminder that when a subject is censored, it is important that they do not differ in some systematic way from everyone else who is still under observation, otherwise the concept that the others fairly ``represent the future'' of the censored observation will not be valid. If there is something different about the censored subject, this will bias in the estimated survival probabilities. Therefore, when estimating separate curves for subgroups of patients (e.g., males and females or treatment groups), the RTTR may be applied separately for each subgroup of interest. The assumption is, e.g., that redistributing a censored male's weight to the other males would provide a cohort whose future is ``more like'' what would have been if said subject were not to be censored. It is a bit like the temptation to leave one's inheritance the the child who is most like you. Of course, one cannot guess the future. Another strategy is attempt balance. For instance, assume that 2/3 of the males but only 1/3 of the females are censored in a particular study. It can be argued that weights should then be redistibuted preferentially to males, thus preserving the overall balance of the study. There are a few ways to go about this. If the data set is large, then a separate logisitic regression model can be fit at each censoring time, with P(censored) as the target response. RTTR weights are reassigned proporional to the fitted probability, or equivalently using 1/P(uncensored). This highlights a close connection between the RTTR and marginal structural models. For smaller data sets one can smooth over time by using an overall Cox model of censoring, and then using per-time point prediction. A Cox model with select time-dependent coefficients would be an intermediate approach. \section{Delayed entry} The discussion so far has assumed that any subject is represented by a proper data set, i.e., no warning or error messages from the \code{survcheck} function, and we will continue to assume that. We have also assumed that all subjects start at the same time, in the same state; this assumption can be relaxed. (This code is not yet implemented.) \begin{itemize} \item Let $p(t)$ be the probability vector at time $t$, one element for each state. \item At the start we almost always have $p(0) = (1, 0, 0, \dots)$, everyone in the same starting state; the starting stated is placed first in the set of states by convention. \item Let $t0$ be the start time, which can be different than 0. Then $p(t0)$ is the emprical distribution of the states among subjects at $t0$. \item When a subject is censored (really censored) their weight is distributed among all remaining subjects in the same state, proportional to case weights. We keep two vectors: working weights, which are subject to redistribution, and the case weights, which stay constant. \item Whenever a subject enters a new state, either delayed entry to the study or a transition from another state, there is a pooling and redistribution. All the working weights in the new state are added up, and then re-apportioned among those at risk for further transitions, proportional to the case weights. It is sort of like a gang of socialist thieves: when a new member joins the crew any resources he/she brought with them, plus all resources currently extant, are pooled and re-divided: everyone is equal. \end{itemize} If we do this, then at any given time the weighted distribution of states is exactly the Aalen-Johansen estimate. For tied times the order of operations is first any moves to a new state, then censorings, then delayed entry. If someone is censored, with no one else to give the weight to, it isn't clear quite what to do. Geskus has another way to do this, which is very clever. Let $G(t)$ be the censoring distribution, as before. Let $H(t)$ be the KM for entry, but computed in reverse time. Then a working weight of $c w_i/[G(s-)H(s-)]$ can be used at time $s$, where $w$ is the case weight. The normalization constant $c$ is use to make the working weights sum to 1 at the start of the proces. \end{document}