R: Checks of a survival data set

survcheck {survival}

R Documentation

Checks of a survival data set

Description

Perform a set of consistency checks on survival data

Usage

survcheck(formula, data, subset, na.action, id, istate, istate0="(s0)", 
timefix=TRUE,...)

Arguments

formula

a model formula with a Surv object as the response

data

data frame in which to find the id, istate and formula variables

subset

expression indicating which subset of the rows of data should be used in the fit. All observations are included by default.

na.action

a missing-data filter function. This is applied to the model.frame after any subset argument has been used. Default is options()\$na.action.

id

an identifier that labels unique subjects

istate

an optional vector giving the current state at the start of each interval

istate0

default label for the initial state of each subject (at their first interval) when istate is missing

timefix

process times through the aeqSurv function to eliminate potential roundoff issues.

...

other arguments, which are ignored (but won't give an error if someone added weights for instance)

Details

This routine will examine a multi-state data set for consistency of the data. The basic rules are that if a subject is at risk they have to be somewhere, can not be two places at once, and should make sensible transitions from state to state. It reports the number of instances of the following conditions:

overlap: two observations for the same subject that overlap in time, e.g. intervals of (0, 100) and (90, 120). If y is simple (time, status) survival then intervals implicitly start at 0, so in that case any duplicate identifiers will generate an overlap.
gap: one or more gaps in a subject's timeline; where they are in the same state at their return as when they left.
jump: a hole in a subject's timeline, where they are in one state at the end of the prior interval, but a new state in the at the start subsequent interval.
teleport: two adjacent intervals for a subject, with the first interval ending in one state and the subsequent interval starting in another. They have instantaneously changed states in 0 units of time.
duplicate: not currently used

The total number of occurences of each is present in the flags vector. Optional components give the location and identifiers of the flagged observations. The Surv function has already flagged any 0 length intervals as errors.

One important caveat is that survcheck does not deal with reuse of an id value. For instance, a multi-institutional data set where the same subject identifier happens to have been used for two different subjects in two different institutions. The routine is likely generate a "false positive" error in this case, but this is simply unavoidable. Since the routine is used internally by survfit, coxph, etc. the same errors will appear in other routines in the survival package.

Value

a list with components

states

the vector of possible states, a union of what appears in the Surv object and istate, with initial states first

transitions

a matrix giving the count of transitions from one state to another

statecount

table of the number of visits per state, e.g., 18 subjects had 2 visits to the "infection" state

flags

a vector giving the counts of each check

istate

a constructed istate that best satisfies all the checks

overlap

a list with the row number and id of overlaps (not present if there are no overlaps)

gaps

a list with the row number and id of gaps (not present if there are no gaps)

teleport

a list with the row number and id of inconsistent rows (not present if there are none)

jumps

a list with the row number and id of jumps (not present if there are no jumps)

Note

For data sets with time-dependent covariates, a given subject will often have intermediate rows with a status of ‘no event at this time’, coded as the first level of the factor variable in the Surv() call. For instance a subject who started in state 'a' at time 0, transitioned to state 'b' at time 10, had a covariate x change from 135 to 156 at time 20, and a final transition to state 'c' at time 30. The response would be Surv(c(0, 10, a), c(10, 20, censor), c(20,0,c)) where the state variable is a factor with levels of censor, a, b, c. The state variable records changes in state, and there was no change at time 20. The istate variable would be (a, b, b); it contains the current state, and the value is unchanged when status = censored. (It behaves like a tdc variable from tmerge).

The intermediate time above is not actually censoring, i.e., a point at which follow-up for the observation ceases. The 'censor' label is traditional, but 'none' may be a more accurate choice.

When there are intermediate observations istate is not simply a lagged version of the state, and may be more challenging to create. One approach is to let survcheck do the work: call it with an istate argument that is correct for the first row of each subject, or no istate argument at all, and then insert the returned value into a data frame.

[Package survival version 3.8-3 Index]