survfit.formula {survival}  R Documentation 
Compute a Survival Curve for Censored Data
Description
Computes an estimate of a survival curve for censored data using the AalenJohansen estimator. For ordinary (single event) survival this reduces to the KaplanMeier estimate.
Usage
## S3 method for class 'formula'
survfit(formula, data, weights, subset, na.action,
stype=1, ctype=1, id, cluster, robust, istate, timefix=TRUE,
etype, model=FALSE, error, entry=FALSE, time0=FALSE, ...)
Arguments
formula 
a formula object, which must have a

data 
a data frame in which to interpret the variables named in the formula,

weights 
The weights must be nonnegative and it is strongly recommended that
they be strictly positive, since zero weights are ambiguous, compared
to use of the 
subset 
expression saying that only a subset of the rows of the data should be used in the fit. 
na.action 
a missingdata filter function, applied to the model frame, after any

stype 
the method to be used estimation of the survival curve: 1 = direct, 2 = exp(cumulative hazard). 
ctype 
the method to be used for estimation of the cumulative hazard: 1 = NelsonAalen formula, 2 = FlemingHarrington correction for tied events. 
id 
identifies individual subjects, when a given person can have multiple lines of data. 
cluster 
used to group observations for the infinitesimal jackknife variance estimate, defaults to the value of id. 
robust 
logical, should the function compute a robust variance. For multistate survival curves or interval censored data this is true by default. For single state data see details, below. 
istate 
for multistate models, identifies the initial state of
each subject or observation. This also forces 
timefix 
process times through the 
etype 
a variable giving the type of event. This has been superseded by multistate Surv objects and is deprecated; see example below. 
model 
include a copy of the model frame in the output 
error 
this argument is no longer used 
entry 
if TRUE, the output will contain 
time0 
if TRUE, the output will include estimates at the starting point of the curve or ‘time 0’. See discussion below. 
... 
The following additional arguments are passed to internal functions
called by

Details
If there is a data
argument, then variables in the formula
,
weights
, subset
, id
, cluster
and
istate
arguments will be searched for in that data set.
The routine returns both an estimated probability in state and an
estimated cumulative hazard estimate.
For simple survival the probability in state = probability alive, i.e,
the estimated survival. For multistate it will be a matrix with one
row per time and a column per state, rows sum to 1.
The cumulative hazard estimate is the NelsonAalen (NA) estimate or the
FlemingHarrington (FH) estimate, the latter includes a correction for
tied event times. The estimated probability in state can estimated
either using the exponential of the cumulative hazard, or as a direct
estimate using the AalenJohansen approach.
For single state data the AJ estimate reduces to the KaplanMeier and
the probability in state to the survival curve;
for competing risks data the AJ reduces to the cumulative incidence (CI)
estimator.
For backward compatability the type
argument can be used instead.
When the data set includes left censored or interval censored data (or both),
then the EM approach of Turnbull is used to compute the overall curve.
Currently this algorithm is very slow, only applies to simple survival
(not multistate), and defaults to a robust variance. Other R
packages are available which implement the iterative convex minorant
(ICM) algorithm for
interval censored data, which is much faster than Turnbull's method.
Based on Sun (2001) the robust variance may be preferred, as the naive estimate
ignores the estimation of the weights. The standard estimate can be
obtained with robust= FALSE
.
Without interval or left censored data (the usual case) the
underlying algorithm for the routine is the AalenJohansen
estimate, of which the KaplanMeier (for single outcome data) and the
cumulative incidence (CI) estimate (for competing risks) are each a special
case. For multistate, the estimate can be written as
p(t_0)H(t_1)H(t_2)\ldots
where p(t_0)
is the prevalance vector across the states at starting
point t_0
, t_1, t_2, \ldots
are the
times at which events (transitions between states) occur, and H are
square transtion matrices with a row for each state.
Starting point: When diffent subjects (id
) start at different
time points, data using age as the time scale for instance,
deciding the default "time 0" can be complex. This value is the
starting point for the restricted mean estimate (area under the
curve), the initial prevalence p0, and the first
row of output if time0 = TRUE
. The order of the decision is
For a 2 column response (simple survival or competing risks) use the minimum of 0 and the smallest time value (times can be negative).
If all subjects start in the same state, start at the same time, or if
p0
is specified, use the minimum observed starting time. If there is noistate
argument all observations are assumed to start in a state "(s0)".Use the minimum observed event time, if the number at risk at that time is >0 for every curve that will be created.
Use the minimum event time for each curve, separately.
The last two above are a failsafe to prevent the routine from basing
the initial prevalence of the states on none or only a handful of
observations. That does not mean such curves
will be scientfically sensible: when using age scale the user may wish
to specify an explicit starting time.
If time0 = TRUE
the first row of output for each curve will be
at the starting time,
otherwise the first event time (for each curve separately).
Robust variance:
If a robust
is TRUE, or for multistate
curves, then the standard
errors of the results will be based on an infinitesimal jackknife (IJ)
estimate, otherwise the standard model based estimate will be used.
For single state curves, the default for robust
will be TRUE
if one of: there is a cluster
argument, there
are noninteger weights, or there is a id
statement
and at least one of the id values has multiple events, and FALSE otherwise.
The default represents our best guess about when one would most
often desire a robust variance.
When there are noninteger case weights and (time1, time2) survival
data the routine is at an impasse: a robust variance likely is called
for, but requires either id
or cluster
information to be
done correctly; it will default to robust=FALSE if they are not present.
With the IJ estimate, the leverage values themselves can be returned
as an array using the influence
argument.
Be forwarned that this array can be huge. Post fit influence using the
resid
method is more flexible and would normally be preferred,
in particular to get influence at only a select set of time points.
The influence
option is currently used mostly in the package's
validity checks.
Let U(t)
be the matrix of IJ values at time t, which has
one row per observation, one column per state. The robust variance
compuation uses the collapsed weighted matrix
rowsum(wU, cluster)
,
where w is the vector of weights and cluster is the grouping (most often
the id). The result for each curve is an array with dimensions
(number of clusters, number of states, number of times), or a matrix
for single state data. When there are multiple curves, the
influence is a list with one element per curve.
Value
an object of class "survfit"
.
See survfit.object
for
details. Some of the methods defined for survfit objects are
print
, plot
,
lines
, points
and residual
.
References
Dorey, F. J. and Korn, E. L. (1987). Effective sample sizes for confidence intervals for survival probabilities. Statistics in Medicine 6, 67987.
Fleming, T. H. and Harrington, D. P. (1984). Nonparametric estimation of the survival distribution in censored data. Comm. in Statistics 13, 246986.
Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data. New York:Wiley.
Kyle, R. A. (1997). Moncolonal gammopathy of undetermined significance and solitary plasmacytoma. Implications for progression to overt multiple myeloma}, Hematology/Oncology Clinics N. Amer. 11, 7187.
Link, C. L. (1984). Confidence intervals for the survival function using Cox's proportional hazards model with covariates. Biometrics 40, 601610.
Sun, J. (2001). Variance estimation of a survival function for intervalcensored data. Stat Med 20, 19491957.
Turnbull, B. W. (1974). Nonparametric estimation of a survivorship function with doubly censored data. J Am Stat Assoc, 69, 169173.
See Also
survfit.coxph
for survival curves from Cox models,
survfit.object
for a description of the components of a
survfit object,
print.survfit
,
plot.survfit
,
lines.survfit
,
residuals.survfit
,
coxph
,
Surv
.
Examples
#fit a KaplanMeier and plot it
fit < survfit(Surv(time, status) ~ x, data = aml)
plot(fit, lty = 2:3)
legend(100, .8, c("Maintained", "Nonmaintained"), lty = 2:3)
#fit a Cox proportional hazards model and plot the
#predicted survival for a 60 year old
fit < coxph(Surv(futime, fustat) ~ age, data = ovarian)
plot(survfit(fit, newdata=data.frame(age=60)),
xscale=365.25, xlab = "Years", ylab="Survival")
# Here is the data set from Turnbull
# There are no interval censored subjects, only leftcensored (status=3),
# rightcensored (status 0) and observed events (status 1)
#
# Time
# 1 2 3 4
# Type of observation
# death 12 6 2 3
# losses 3 2 0 3
# late entry 2 4 2 5
#
tdata < data.frame(time =c(1,1,1,2,2,2,3,3,3,4,4,4),
status=rep(c(1,0,2),4),
n =c(12,3,2,6,2,4,2,0,2,3,3,5))
fit < survfit(Surv(time, time, status, type='interval') ~1,
data=tdata, weight=n)
#
# Three curves for patients with monoclonal gammopathy.
# 1. KM of time to PCM, ignoring death (statistically incorrect)
# 2. Competing risk curves (also known as "cumulative incidence")
# 3. Multistate, showing Pr(in each state, at time t)
#
fitKM < survfit(Surv(stop, event=='pcm') ~1, data=mgus1,
subset=(start==0))
fitCR < survfit(Surv(stop, event) ~1,
data=mgus1, subset=(start==0))
fitMS < survfit(Surv(start, stop, event) ~ 1, id=id, data=mgus1)
## Not run:
# CR curves show the competing risks
plot(fitCR, xscale=365.25, xmax=7300, mark.time=FALSE,
col=2:3, xlab="Years post diagnosis of MGUS",
ylab="P(state)")
lines(fitKM, fun='event', xmax=7300, mark.time=FALSE,
conf.int=FALSE)
text(3652, .4, "Competing risk: death", col=3)
text(5840, .15,"Competing risk: progression", col=2)
text(5480, .30,"KM:prog")
## End(Not run)