\documentclass{article} \usepackage{amsmath} \usepackage{Sweave} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} %\VignetteIndexEntry{Splines, plots, and interactions} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \SweaveOpts{prefix.string=splines,width=6,height=4} \setkeys{Gin}{width=\textwidth} \newcommand{\code}[1]{\texttt{#1}} <>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=8) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #reset default @ \title{Spline terms in a Cox model} \author{Terry Therneau} \begin{document} \maketitle This is a trio of topics that comes up just often enough in my work that I end up re-discovering how to do it correctly about once a year. A note showing how may be useful to others, it is certainly a useful reference for me. \section{Plotting smooth terms} Here is a simple example using the MGUS data. I prefer a simpler color palette than the default found in termplot. <>= library(survival) mfit <- coxph(Surv(futime, death) ~ sex + pspline(age, df=4), data=mgus) mfit termplot(mfit, term=2, se=TRUE, col.term=1, col.se=1) @ Note that the \code{term=2} option is passed directly from the \code{termplot} routine to a \code{predict(fit, type='terms')} call. For coxph models, the \code{predict} function allows terms to be specified either by position or name. Other routines, e.g. \code{gam}, respond only to a name. (This can be a bit of a pain since it must exactly match the \emph{printed} call in both spelling and spacing; and the printed spacing may not match what the user typed.) Three questions are whether the curve is significantly non-linear, how the curve is centered and whether we can easily plot it on the hazard as opposed to the log hazard scale. The first question is answered by the printout, the solution to the others is to use the plot=FALSE option of termplot, which returns the data points that would be plotted back to the user. <>= ptemp <- termplot(mfit, se=TRUE, plot=FALSE) attributes(ptemp) ptemp$age[1:4,] @ The termplot function depends on a call to predict with type='terms', which returns a centered set of predictions. Like a simple linear model fit, the intercept is a separate term, which is found in the ``constant'' attribute above, and each column of the result is centered so that the average predicted value is zero. Since any given $x$ value may appear multiple times in the data and thus in the result of predict, and the termplot function removes duplicates, the data returned by \code{termplot} may not be precisely centered at zero. Now suppose we want to redraw this on log scale with age 50 as the reference, i.e., the risk is 1 for a 50 year old. Since the Cox model is a relative hazards model we can choose whatever center we like. (If there were no one of exactly age 50 in the data set the first line below would need to do an interpolation, e.g. by using the approx function.) <>= ageterm <- ptemp$age # this will be a data frame center <- with(ageterm, y[x==50]) ytemp <- ageterm$y + outer(ageterm$se, c(0, -1.96, 1.96), '*') matplot(ageterm$x, exp(ytemp - center), log='y', type='l', lty=c(1,2,2), col=1, xlab="Age at diagnosis", ylab="Relative death rate") @ Voila! We now have a plot that is interpretable with respect to a fixed reference. The approach is appropriate for any term, not just psplines. The above plot uses log scale for the y axis which is appropriate for the question of whether a non-linear age effect was even necessary for this model (it is not), one could remove the log argument to emphasize the Gomperzian effect of age on mortality. \section{Monotone splines} Consider the following model using the \code{mgus2} data set. <>= fit <- coxph(Surv(futime, death) ~ age + pspline(hgb, 4), mgus2) fit termplot(fit, se=TRUE, term=2, col.term=1, col.se=1, xlab="Hemoglobin level") @ Low hemoglobin or anemia is a recognized marker of frailty in older age, so the rise in risk for low levels is not surprising. The rise on the right hand portion of the curve is less believable --- the normal range of HGB is 12-15.5 for women and 13.5 to 17.5 for men, why would we expect a rise there? A monotone fit that forces the curve to be horizontal from 14 onward fits well within the confidence bands, so we might want to force monotonicity. There are two tools for this within the pspline function. The first is to decrease the overall degrees of freedom and the second is to use \code{combine} option to force equality of selected coefficients. Start by decreasing the degrees of freedom. The pspline function automatically picks the number of basis (nterms) to be ``sufficiently large'' for the given degrees of freedom. We fix it at a single value for the rest of this example to better isolate the effects of degrees of freedom and of constraints. <>= termplot(fit, se=TRUE, col.term=1, col.se=1, term=2, xlab="Hemoglobin level", ylim=c(-.4, 1.3)) df <- c(3, 2.5, 2) for (i in 1:3) { tfit <- coxph(Surv(futime, death) ~ age + pspline(hgb, df[i], nterm=8), mgus2) temp <- termplot(tfit, se=FALSE, plot=FALSE, term=2) lines(temp$hgb$x, temp$hgb$y, col=i+1, lwd=2) } legend(14, 1, paste("df=", c(4, df)), lty=1, col=1:4, lwd=2) @ This has reduced, but not eliminated, the right hand rise at the expense of a less sharp transition at the value of 14. The \code{combine} option makes use of a property of the P-spline basis, which is that the curve will be monotone if and only if the coefficients are monotone. We can then use a pool adjacent violators algorithm to sequentially force equality for those coefficients which go the wrong way. Look at the coefficients for the fit with 2.5 degrees of freedom. <>= fit2a <- coxph(Surv(futime, death) ~ age + pspline(hgb, 2.5, nterm=8), mgus2) coef(fit2a) plot(1:10, coef(fit2a)[-1]) @ Now force the last 3 to be equal, then the last 4, and see how this changes the fit. <>= temp <- c(1:7, 8,8,8) fit2b <- coxph(Surv(futime, death) ~ age + pspline(hgb, 2.5, nterm=8, combine=temp), data= mgus2) temp2 <- c(1:6, 7,7,7,7) fit2c <- coxph(Surv(futime, death) ~ age + pspline(hgb, 2.5, nterm=8, combine=temp2), data= mgus2) matplot(1:10, cbind(coef(fit2a)[-1], coef(fit2b)[temp+1], coef(fit2c)[temp2+1]), type='b', pch='abc', xlab="Term", ylab="Pspline coef") @ We see that constraining the last four terms along with a degrees of freedom of is almost enough to force monotonicity; it may be sufficient if our goal is a simple plot for display. This dance between degrees of freedom, number of terms, and constraints has a component of artistry. When all three values become large the result will begin to approach a step function, reminiscent of non-parametric isotonic regression, whereas small values begin to approach a linear fit. The best compromise of smoothness and constraints will be problem specific. \section{Splines in an interaction} As an example we will use the effect of age on survival in the \texttt{flchain} data set, a population based sample of subjects from Olmsted County, Minnesota. If we look at a simple model using age and sex we see that both are very significant. <>= options(show.signif.stars=FALSE) # display intelligence fit1 <- coxph(Surv(futime, death) ~ sex + pspline(age, 3), data=flchain) fit1 termplot(fit1, term=2, se=TRUE, col.term=1, col.se=1, ylab="log hazard ratio") @ We used a \code{pspline} term rather than \code{ns}, say, because the printout for a pspline nicely segregates the linear and non-linear age effects. The non-linearity is not very large, as compared to the linear portion, but still may be important. We would like to go forward and fit separate age curves for the males and the females, since the above fit makes an untested assumption that the male/female ratio of death rates will be the same at all ages. The primary problem is that a formula of \texttt{sex * pspline(age)} does not work; the coxph routine is not clever enough to do the right thing automatically for \code{pspline} interactions. (Perhaps some future version will be sufficiently intelligent, but don't hold your breath). As a first solution we will use regression splines, i.e., splines that can be represented using a basis matrix. <>= options(show.signif.stars=FALSE) # display statistical intellegence require(splines, quietly=TRUE) nfit1 <- coxph(Surv(futime, death) ~ sex + age, flchain) nfit2 <- coxph(Surv(futime, death) ~ sex + ns(age, df=3), flchain) nfit3 <- coxph(Surv(futime, death) ~ sex * ns(age, df=3), flchain) anova(nfit1, nfit2, nfit3) @ The nonlinear term is significant but the interaction is not. Nevertheless we would like to plot the two estimated curves for \code{nfit3}, expecting that they will be approximately parallel. The \code{termplot} routine is not able to deal with models that have an interaction and will bow out with a warning message; we use explicit prediction instead, which is nearly as easy. <>= pdata <- expand.grid(age= 50:99, sex=c("F", "M")) pdata[1:5,] ypred <- predict(nfit3, newdata=pdata, se=TRUE) yy <- ypred$fit + outer(ypred$se, c(0, -1.96, 1.96), '*') matplot(50:99, exp(matrix(yy, ncol=6)), type='l', lty=c(1,1,2,2,2,2), lwd=2, col=1:2, log='y', xlab="Age", ylab="Relative risk") legend(55, 20, c("Female", "Male"), lty=1, lwd=2, col=1:2, bty='n') abline(h=1) @ The \code{ns} function generates a basis of dummy variables to represent the spline, which will work automatically in interactions. The coefficients that result are not very interpretable, but the result of predict is invariant to this. The issues as compared to using \code{termplot} are \begin{enumerate} \item We need to provide our own set of predictor values for the plot, whereas \code{termplot} would automatically use the set of unique age and sex values. \item Keeping track of the indexing. The predict function produces two vectors each of length \code{nrow(pdata)} with a predicted value and its standard error, one value for each row of \code{pdata}. The data set is in order of females, then males. We fold it into an appropriate matrix for use with matplot. \item The vertical centering of the curves corresponds to an average population predictor of $\eta = X\beta =0$; i.e., the average over the subjects in the data set. This is a consequence of the variable centering that \code{coxph} does for numerical stability. To center at some particular (age, sex) pair obtain the predicted value for that fictional subject and subtract the resulting value from \code{yy}. \end{enumerate} To create the same figure with \code{pspline} curves it is necessary to code the males and females as two separate terms. To do this create our own dummy variables to handle the interaction. <>= agem <- with(flchain, ifelse(sex=="M", age, 60)) agef <- with(flchain, ifelse(sex=="F", age, 60)) fit2 <- coxph(Surv(futime, death) ~ sex + pspline(agef, df=3) + pspline(agem, df=3), data=flchain) anova(fit2, fit1) @ You might well ask why we used 60 as a dummy value of \texttt{agem} for the females instead of 0? If a value of 0 is used it forces the pspline function to create a basis set that includes all the empty space between 0 and 50, and do predictions at 0; these last can become numerically unstable leading to errors or incorrect values. Best is to pick a value close to the mean, though any value within the range will do. For this plot we will use a 65 year old female as the reference. <>= # predictions pdata2 <- pdata pdata2$agem <- with(pdata2, ifelse(sex=="M", age, 60)) pdata2$agef <- with(pdata2, ifelse(sex=="F", age, 60)) ypred2 <- predict(fit2, pdata2, se=TRUE) yy <- ypred2$fit + outer(ypred2$se, c(0, -1.96, 1.96), '*') # reference refdata <- data.frame(sex='F', agef=65, agem=60) ref <- predict(fit2, newdata=refdata, type="lp") # plot matplot(50:99, exp(matrix(yy-ref, ncol=6)), type='l', lty=c(1,1,2,2,2,2), lwd=2, col=1:2, log='y', xlab="Age", ylab="Relative risk") legend(55, 20, c("Female", "Male"), lty=1, lwd=2, col=1:2, bty='n') abline(h=1) @ The final curves for males and female are not quite parallel, most of the difference is at the highest ages, however, where there are very few subjects. One thing the plot does not display is that the spacing between the male and female points also has a standard error. This moves the entire bundle of three red curves up and down. It is not clear how best to add this information into the plot. For questions of parallelism and shape, as here, it seemed best to ignore it, which is what the termplot function also does. If someone were reading individual male/female differences off the plot a different choice would be appropriate. \end{document}