[R] A stats question -- about survival analysis and censoring

Geoff Russell geoffrey.russell at gmail.com
Tue Mar 11 11:47:04 CET 2008


Hi Max, Matthias and David and anyone else interested,

My question was prompted by seeing various prospective
cohort studies which are deriving the RR of colorectal cancer due to
smoking and which simply censor
people who die from diseases other then colorectal cancer and then use
coxph to get an estimate of the CRC due to smoking.  When (and only when)
those diseases are also caused by smoking then I think this reduces the
apparent risk ratio of crc due to smoking. The following R code
summarises what I think is happening and why simply censoring these other
deaths is wrong. Mind you I have absolutely no idea how these studies
should proceed, its just that it looks wrong to me and I'm trying to find out
what other people think.

Some might argue that dead people can't get crc, but consider a second
case analogous to the first. Red meat and crc. If heart disease is censored then
the RR of crc due to red meat is reduced. But as heart disease is reduced by
better treatment, the crc "suddenly" emerges. The relationship was hidden
by virtue of the censoring protocol.

Cheers,
Geoff.

R code follows



library(survival)
set.seed(2)

popsize<-3000
smoking<-popsize/2
nonsmoke<-popsize/2

#----------------------------------------------------------
# generate a bunch of times for people to be either
# censored or diagnosed
#----------------------------------------------------------
time<-c(rnorm(popsize,75,15))

#----------------------------------------------------------
# allocate people to smoking/nonsmoking groups
#----------------------------------------------------------
smokestatus<-c(rep(1,smoking),rep(0,nonsmoke))

#----------------------------------------------------------
# pick a colorectal cancer (crc) rate of 1 in 20 uniformly distributed
# Aussie lifetime rate is about 1 in 17
#----------------------------------------------------------
status<-c(sample(c(0,1),popsize,replace=T,prob=c(0.95,0.05)))

#----------------------------------------------------------
# alternatively generate a higher rate of crc in the smokers
#----------------------------------------------------------
statusAlt<-c(
 	sample(c(0,1),smoking,replace=T,prob=c(0.9,0.1)),
 	sample(c(0,1),nonsmoke,replace=T,prob=c(0.95,0.05))
 	)

#----------------------------------------------------------
# now generate some heart deaths with more in the smokers
# The ratio of deaths from heart disease to crc is about 5
# to 1 in EPIC-oxford cohort, and our numbers below are similar.
#----------------------------------------------------------
 heartattack<-c(
 sample(c(0,1),smoking,replace=T,prob=c(0.7,0.3)),
 sample(c(0,1),nonsmoke,replace=T,prob=c(0.85,0.15))
 )

#----------------------------------------------------------
# people are only diagnosed with CRC if they don't
# die of heart disease --- on assumption that heart
# disease usually occurs before crc
#----------------------------------------------------------
 statusHd<-as.integer(status&(!heartattack))

 statusAltHd<-as.integer(statusAlt&(!heartattack))

 data <- data.frame(time,status,smokestatus)
 dataHd <- data.frame(time,statusHd,smokestatus)

 dataAlt <- data.frame(time,statusAlt,smokestatus)
 dataAltHd <- data.frame(time,statusAltHd,smokestatus)

 cSmoke<-coxph(Surv(time,status)~smokestatus, data)
 cHd<-coxph(Surv(time,statusHd)~smokestatus, dataHd)

 cAltSmoke<-coxph(Surv(time,statusAlt)~smokestatus, dataAlt)
 cAltHd<-coxph(Surv(time,statusAltHd)~smokestatus, dataAltHd)
#----------------------------------------------------------
# as expected RR for crc of smokers relative to non-smokers is about 1
#----------------------------------------------------------
 summary(cSmoke)
Call:
coxph(formula = Surv(time, status) ~ smokestatus, data = data)

  n= 3000
               coef exp(coef) se(coef)      z    p
smokestatus -0.0221     0.978    0.149 -0.148 0.88

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus     0.978       1.02      0.73      1.31

Rsquare= 0   (max possible= 0.563 )
Likelihood ratio test= 0.02  on 1 df,   p=0.882
Wald test            = 0.02  on 1 df,   p=0.882
Score (logrank) test = 0.02  on 1 df,   p=0.882

#----------------------------------------------------------
# but the RR drops when heart disease takes out crc cases
#----------------------------------------------------------
summary(cHd)
Call:
coxph(formula = Surv(time, statusHd) ~ smokestatus, data = dataHd)

  n= 3000
              coef exp(coef) se(coef)      z    p
smokestatus -0.164     0.848    0.183 -0.898 0.37

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus     0.848       1.18     0.592      1.21

Rsquare= 0   (max possible= 0.424 )
Likelihood ratio test= 0.81  on 1 df,   p=0.369
Wald test            = 0.81  on 1 df,   p=0.369
Score (logrank) test = 0.81  on 1 df,   p=0.369


#----------------------------------------------------------
# when there really is a relationship
#----------------------------------------------------------
 summary(cAltSmoke)

Call:
coxph(formula = Surv(time, statusAlt) ~ smokestatus, data = dataAlt)

  n= 3000
             coef exp(coef) se(coef)    z     p
smokestatus 0.535      1.71    0.137 3.89 1e-04

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus      1.71      0.586      1.30      2.23

Rsquare= 0.005   (max possible= 0.659 )
Likelihood ratio test= 15.7  on 1 df,   p=7.59e-05
Wald test            = 15.1  on 1 df,   p=0.000101
Score (logrank) test = 15.5  on 1 df,   p=8.34e-05

#----------------------------------------------------------
# and the RR drops when heart disease is added
#----------------------------------------------------------
 summary(cAltHd)
Call:
coxph(formula = Surv(time, statusAltHd) ~ smokestatus, data = dataAltHd)

  n= 3000
             coef exp(coef) se(coef)    z     p
smokestatus 0.362      1.44    0.162 2.24 0.025

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus      1.44      0.696      1.05      1.97

Rsquare= 0.002   (max possible= 0.526 )
Likelihood ratio test= 5.09  on 1 df,   p=0.024
Wald test            = 5.01  on 1 df,   p=0.0252
Score (logrank) test = 5.07  on 1 df,   p=0.0244



More information about the R-help mailing list