[R] How to use the whole dataset (including between events) in Cox model (time-varying covariates) ?

Fri Aug 13 16:47:46 CEST 2004

>First, I do not know how to specify such a link function in R.
>Second, if I can specify such alink, I could use (in place of d*j), the
>smooth baseline estimated after doing a Cox regression. But I don't know
>how to fit (for instance) a piecewise constant baseline hazard with a
>Poison glm

I found a possible answer to one of my own question, sorry for the
nuisance.
Simply add
+factor(year)
to the poisson model will fit a piecewise constant baseline hazard (one
step every year).
or
+cut(year,br=seq(1961,1997,by=2))
to have one step every two years.
(Hope the rest is correct...)

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

----- Original Message ----- 
From: "Prof Brian Ripley" <ripley at stats.ox.ac.uk>
To: "Mayeul KAUFFMANN" <mayeul.kauffmann at tiscali.fr>
Cc: <r-help at stat.math.ethz.ch>
Sent: Friday, August 13, 2004 8:40 AM
Subject: Re: [R] How to use the whole dataset (including between events)
in Cox model (time-varying covariates) ?

> This is the consequence of the use of partial likelihood in the Cox
model.
> You should read the literature on this point (for example, have you read
> Cox, 1972 and all its discussion, or Anderson, Borgan, Gill & Keiding?).
>
> It is not an R question.  You need to make more assumptions, such as a
> smooth baseline hazard, and you can always use parametric models and a
> full likelihood (but you may have to program them yourself).
>
>
> On Fri, 13 Aug 2004, Mayeul KAUFFMANN wrote:
>
> > Hello,
> >
> > coxph does not use any information that are in the dataset between
event
> > times (or "death times") , since computation only occurs at event
times.
> >
> > For instance, removing observations when there is no event at that
time in
> > the whole dataset does not change the results:
> > > set.seed(1)
> > > data <-
> >
as.data.frame(cbind(start=c(1:5,1:5,1:4),stop=c(2:6,2:6,2:5),status=c(rep(
> > 0,7),1,rep(0,5),1),id=c(rep(1,5),rep(2,5),rep(3,4)),x1=rnorm(14)))
> > > data
> > start stop status id x1
> > 1 1 2 0 1 -0.6264538
> > 2 2 3 0 1 0.1836433
> > 3 3 4 0 1 -0.8356286
> > 4 4 5 0 1 1.5952808
> > 5 5 6 0 1 0.3295078
> > 6 1 2 0 2 -0.8204684
> > 7 2 3 0 2 0.4874291
> > 8 3 4 1 2 0.7383247
> > 9 4 5 0 2 0.5757814
> > 10 5 6 0 2 -0.3053884
> > 11 1 2 0 3 1.5117812
> > 12 2 3 0 3 0.3898432
> > 13 3 4 0 3 -0.6212406
> > 14 4 5 1 3 -2.2146999
> > coxph(Surv(start, stop,status)~ cluster(id)+x1,data=data ,robust=T)
> > coxph(Surv(start, stop,status)~ cluster(id)+x1,data=subset(data,stop
%in%
> > 4:5) ,robust=T) # the same !!! (except n)
> >
> > First, some data is lost.
> > Second, this loss could be an important problem when  there is a
> > time-varying covariate that changes quicker than the frequency  of
events.
> > Specifically, I have a covariate which has low values most of the
time. It
> > sometimes jumps to high values and that is hypothesized as greatly
> > increasing the risk of an event.
> > With rare events, the effect of this covariate will only be measured
at
> > event times. Chances are that the only time such a covariate is
recorded
> > at high level, the individual for which it is measured as being high
is
> > having an event.
> > This may bias the estimated coefficient.
> >
> > Here is my question:
> > How to fully use the dataset?
> >
> > (that is: how to have really _time-varying_ covariates (even if they
> > change step by step, not continuously), not covariates whose changes
are
> > measured only at event time )
> >
> > Ideally, the full dataset would be use to estimate the parameters, or
at
> > least to estimate the standard error of the estimated parameters.
> > Any ideas ???
> > .
> > .
> > .
> >
> > A second best (which might require less work) would be to use all the
> > dataset to assess the predictive power of the model.
> >
> > Maybe by using the expected number of events for an individual over
the
> > time interval that they were observed to be at risk
> > > predict(coxfit,type="expected")
> > and compare it with observed number of events
> > (does it use all data and takes into account all the baseline hazard,
even
> > between events?)
> >
> >
> > Or, if not,  following Brian D. Ripley suggestion about the baseline
> > hazard: "As an approximation you can smooth the fitted
> > baseline cumulative hazard (e.g. by package pspline) and ask for its
> > derivative"
(https://stat.ethz.ch/pipermail/r-help/2004-July/052376.html)
> >
> > the following code could be use to estimate (and plot) a smooth
baseline
> > hazard:
> > > t<-seq(min(data$start),max(data$stop),length=100)
> > > lines(t,
> >
predict(sm.spline(x=basehaz(coxfit)[,2],y=basehaz(coxfit)[,1],norder=2),
> > t,1))
> > #there is a problem with this code. One should add the contraint that
the
> > baseline hazard cannot be negative.
> >
> > The following computes the parametric part of the cox model.
> > > risk <- predict(coxfit, type='risk') # gives exp(X'b)
> >
> > something like
> > > baseline.hazard*risk
> > would give the true risk at any time (but it would be probably much
harder
> > to compute)
> >
> > which could help assess the predictive power of the model.
> > (still a lot of work)
> >
> > Thanks in advance for any help or comment.
>
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>