[R] Using coxph with Gompertz-distributed survival data.
therneau at mayo.edu
Fri Feb 5 17:48:31 CET 2010
Before being helpful let me raise a couple of questions:
1. "I know I'm looking at longevity data (which is believed to have a
Gompertz distribution for mammals dying from 'old age')".
I'm not as convinced. The Gompertz is a nice story, but is
confounded by individual risk or 'frailty'. But continue
2. "the mortality rate will be higher for 'a' younger ages, higher for
'b' at older ages, and the assumption of the Cox Proportional Hazards
model is violated a priori, isn't it?"
That is correct. So why exactly are you using coxph to fit the data.
3. "Yet I found plenty of Gompertz parameter values that differ, and
lead to differences in survival times detectable by coxph, yet pass the
cox.zph test. Should I assume that cox.zph is insufficiently
The Cox model fit a model with an average hazard ratio over time. If
the data satisfies the proportional hazards model, then this is all you
need -- this single number tells you everything. If the data does not,
this does not mean that such an average hazard is invalid, it tells you
that this average is not the whole story and coxph is an
oversimplification. I view this as similar to the fact that if a
distribution is Gaussian then then (mean, var) is sufficient, everything
that you ever wanted to know about the data (percentiles, outliers, ...)
is summed up in those two values. If it's not Gaussian it does not
follow that the mean is worthless, but it isn't a complete story.
If you pick your parameters so that the change in hazard ratio is "not
very large", of course cox.zph will not see it. That's also the case
where an overall average is probably a pretty good summary.
4: "coxph(Surv(age) ~ group + group:age)"
This is not how a change in hazard ratio over time is approached. The
program should give an error. For one, why do you assume the change is
linear in time? This is rather rare. You might look at the timedep
5. Some actual advice -- if you think it is Gompertzian why not fit a
I don't see anything in CRAN to directly fit Gompertz, but the note
below talks about how to do so approximately with survreg. It's a note
to myself of something to add to the survival package documentation, not
yet done, and to my embarassment the file has a time stamp in 1996. Ah
-------------- next part --------------
My document "A Package for Survival Analysis in S" contains statements
about how to fit Gompertz and Rayleigh distributions with the survreg
routine. Nicholas Brouard, in a recent query to this group, quite correctly
states that "Therneau's documentation is a little elliptic for people not
so familiar with extreme value theory".
I've spent the last day trying to work out concrete examples of the fits.
Let me start by saying that I now think my paper's remarks were overly
optimistic. This note will try to indicate why. I will use some "TeX"
notation below: \alpha, \beta, etc for Greek letters.
Weibull: p*(\lambda)^p * t^(p-1)
Extreme value: (1/ \sigma) * exp( (t- \eta)/ \sigma)
Rayleigh: a + bt
Gompertz: b * c^t
Makeham: a + b* c^t
The Makeham hazard seems to fit human mortality experience beyond
infancy quite well, where "a" is a constant mortality which is
independent of the health of the subject (accidents, homicide, etc)
and the second term models the Gompertz assumption that "the average
exhaustion of a man's power to avoid death to is such that at the end
of equal infinitely small itervals of time he lost equal portions of
his remaining power to oppose destruction which he had at the
commencement of these intervals". For older ages "a" is a neglible
portion of the death rate and the Gompertz model holds.
The fitting routine depends on the decomposition Y = \eta + \sigma W, where
\eta = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + ... is the fitted linear
predictor and W is a distribution in "standard" form. For instance, if
the response time t is Weibull, then y = log(t) follows this with
\eta = log(\lambda)
\sigma = 1/p
1. The Wiebull distribution with p=2 (sigma=.5) is the same as a Rayleigh
distribution with a=0. It is not, however, the most general form of a
2. The (least) extreme value and Gompertz distributions have the same
hazard function, with
\sigma = 1/ log(c), and exp(-\eta/ \sigma) = b.
It would appear that the Gompertz can be fit with an identity link function
combined with the extreme value distribution. However, this ignores a
boundary restriction. If f(x; \eta, \sigma) is the extreme value distribution
with paramters \eta and \sigma, then the definition of the Gompertz densitiy
g(x; \eta, \sigma) = 0 x< 0
g(x; \eta, \sigma) = c f(x; \eta, \sigma) x>=0
where c= exp(exp(-\eta / \sigma)) is the necessary constant so that g integrates
to 1. If \eta / \sigma is far from 1, then the correction term will be
minimal and survreg should give a reasonable answer. If not, the distribution
can't be fit, nor can it be made to easily conform to the general fitting
scheme of the program.
The Makeham distribution falls into the gamma family (equation 2.3 of
Kalbfleisch and Prentice, Survival Analysis), but with the same range
In summary, the Gompertz is a truncated form of the extreme value
distribution (Johnson, Kotz and Blakrishnan, Contiuous Univariate Distri-
butions, section 22.8). If one ignores the truncation, i.e., assume that
negative time values are possible, then it can be fit with survreg. My
original note seems to have been compounded of 3 errors: the -1 arises
from confusing the maximal extreme distribution (most common in theory books)
with the minimal extreme distribution (used in survreg), the log() term
was a typing mistake, and I never noticed the range restriction.
This is one of the few topics in the report without a worked example
as part of my test library (the Examples section of the package). The
replacement document, currently in early draft, is intended to have a worked
example for every claim and the code for that example in the appendix.
This will, hopefully, cure any other mistakes of this sort.
More information about the R-help