[R] Survival analysis and predict time-to-death
Therneau, Terry M., Ph.D.
therneau at mayo.edu
Tue Aug 18 15:19:29 CEST 2015
I read this list a day late as a digest so my answers are rarely the first. (Which is
nice as David W answers most of the survival questions for me!)
What you are asking is reasonable, and in fact is common practice in the realm of
industrial reliability, e.g., Meeker and Escobar, Statistical Methods for Reliability
Analysis. Extrapolation of the survival curve to obtain the mean and percentiles of the
lifetime distribution for some device (e.g. a washing machine) is their bread and butter,
used for instance to determine the right size for an inventory of spare parts. For most
of us on this list who do medical statistics and live in the Kaplan-Meier/ Cox model world
the ideas are uncommon. I was lucky enough to sit through one of Bill Meeker's short
courses and retain some (minimal) memory of it.
1. You are correct that parametric models are essential. If the extrapolation is
substantial (30% or more censored, say), then the choice of distribution can be critical.
If failure is due to repeated insult, e.g., the multi-hit model, then Weibull tends to
be preferred; if it is from degradation, e.g., flexing of a diaphram, then the log-normal.
Beyond this you need more guidance than mine.
2. The survreg routine assumes that log(y) ~ covariates + error. For a log-normal
distribion the error is Gaussian and thus the predict(fit, type='response') will be
exp(predicted mean of log time), which is not the predicted mean time. For Weibull the
error dist is asymmetric so things are more muddy. Each is the MLE prediction for the
subject, just not interpretable as a mean. To get the actual mean you need to look up the
formulas for Weibull and/or lognormal in a textbook, and map from the survreg
parameterization to whatever one the textbook uses. The two parameterizations are never
the same.
3. Another option is predicted quantiles. ?predict.survreg shows how to get the entire
survival curve. The mean can be obtained as the area under the survival curve. Relevant
to your question, the expected time remaining for a subject still alive at time =10, say,
is integral(S(t), from 10 to infin) / S(10), where S is the survival curve. You can also
read off quantiles of the expected remaining life.
Terry Therneau
(author of the survival package)
On 08/18/2015 05:00 AM, r-help-request at r-project.org wrote:
> Dear All,
>
> I would like to build a model, based on survival analysis on some data, that
> is able to predict the /*expected time until death*/ for a new data
> instance.
>
> Data
> For each individual in the population I have the, for each unit of time, the
> status information and several continuous covariates for that particular
> time. The data is right censored since at the end of the time interval
> analyzed, instances could be still alive and die later.
>
> Model
> I created the model using R and the survreg function:
>
> lfit <- survreg(Surv(time, status) ~ X)
>
> where:
> - time is the time vector
> - status is the status vector (0 alive, 1 death)
> - X is a bind of multiple vectors of covariates
>
> Predict time to death
> Given a new individual with some covariates values, I would like to predict
> the estimated time to death. In other words, the number of time units for
> which the individual will be still alive till his death.
>
> I think I can use this:
>
> ptime <- predict(lfit, newdata=data.frame(X=NEWDATA), type='response')
>
> Is that correct? Am I going to get the expected-time-to-death that I would
> like to have?
>
> In theory, I could provide also the time information (the time when the
> individual has those covariates values), should I simply add that in the
> newdata:
>
> ptime <- predict(lfit, newdata=data.frame(time=TIME, X=NEWDATA),
> type='response')
>
> Is that correct? Is this going to improve the prediction? (for my data, the
> time already passed should be an important variable).
>
> Any other suggestions or comments?
>
> Thank you!
More information about the R-help
mailing list