[R] How to simulate informative censoring in a Cox PH model?
Daniel Meddings
dpmeddings at gmail.com
Mon Aug 3 22:49:54 CEST 2015
Hi Greg
The copulas concept seems a nicely simple way of simulating event times
that are subject to informative censoring (in contrast to the double cox
model approach I use). The correlation between the marginal uniform random
variables you speak of reminded me that my approach should also induce this
correlation, just in a different way. Similarly I should also observe zero
correlation between my event times from my outcome model and the censoring
times. Unfortunately this was not the case - to cut a long story short I
was inadvertently generating my independent censoring times from a model
that depended on covariates in the outcome model. This now explains the
mixed results I rather laboriously attempted to describe previously.
Re-running some scenarios with my new error-free code I can now clearly
observe the points you have been making, that is informative censoring only
leads to bias if the covariates in the censoring model are not in the
outcome model. Indeed I can choose the common (to both models) treatment
effect to be vastly different (with all other effects the same) and have no
bias, yet small differences in the censoring Z effect (not in the outcome
model) effect lead to moderate biases.
I am still somewhat confused at the other approach to this problem where I
have seen in various journal articles authors assuming an outcome model for
the censored subjects - i.e. an outcome model for the unobserved event
times. Using this approach the definition of informative censoring appears
to be where the observed and un-observed outcome models are different. This
approach also makes sense to me - censoring merely loses precision of the
parameter estimators due to reduced events, but does not lead to bias.
However the concept of correlated event and censoring times does not even
present itself here?
Thanks
Dan
On Fri, Jul 31, 2015 at 5:06 PM, Greg Snow <538280 at gmail.com> wrote:
> Daniel,
>
> Basically just responding to your last paragraph (the others are
> interesting, but I think that you are learning as much as anyone and I
> don't currently have any other suggestions).
>
> I am not an expert on copulas, so this is a basic understanding, you
> should learn more about them if you choose to use them. The main idea
> of a copula is that it is a bivariate or multivariate distribution
> where all the variables have uniform marginal distributions but the
> variables are not independent from each other. How I would suggest
> using them is to choose a copula and generate random points from a
> bivariate copula, then put those (uniform) values into the inverse pdf
> function for the Weibull (or other distribution), one of which is the
> event time, the other the censoring time. This will give you times
> that (marginally) come from the distributions of interest, but are not
> independent (so would be considered informative censoring). Repeat
> this with different levels of relationship in the copula to see how
> much difference it makes in your simulations.
>
> On Thu, Jul 30, 2015 at 2:02 PM, Daniel Meddings <dpmeddings at gmail.com>
> wrote:
> > Thanks Greg once more for taking the time to reply. I certainly agree
> that
> > this is not a simple set-up, although it is realistic I think. In short
> you
> > are correct about model mis-specification being the key to producing more
> > biased estimates under informative than under non-informative censoring.
> > After looking again at my code and trying various things I realize that
> the
> > key factor that leads to the informative and non-informative censoring
> data
> > giving rise to the same biased estimates is how I generate my Z_i
> variable,
> > and also the magnitude of the Z_i coefficient in both of the event and
> > informative censoring models.
> >
> > In the example I gave I generated Z_i (I think of this as a "poor
> prognosis"
> > variable) from a beta distribution so that it ranged from 0-1. The biased
> > estimates for "beta_t_1" (I think of this as the effect of a treatment on
> > survival) were approximately 1.56 when the true value was -1. What I
> forgot
> > to mention was that estimating a cox model with 1,000,000 subjects to the
> > full data (i.e. no censoring at all) arguably gives the best treatment
> > effect estimate possible given that the effects of Z_i and Z_i*Treat_i
> are
> > not in the model. This "best possible" estimate turns out to be 1.55 -
> i.e.
> > the example I gave just so happens to be such that even with 25-27%
> > censoring, the estimates obtained are almost the best that can be
> attained.
> >
> > My guess is that the informative censoring does not bias the estimate
> more
> > than non-informative censoring because the only variable not accounted
> for
> > in the model is Z_i which does not have a large enough effect "beta_t_2",
> > and/or "beta_c_2", or perhaps because Z_i only has a narrow range which
> does
> > not permit the current "beta_t_2" value to do any damage?
> >
> > To investigate the "beta_t_2", and/or "beta_c_2" issue I changed
> "beta_c_2"
> > from 2 to 7 and "beta_c_0" from 0.2 to -1.2, and "beta_d_0" from -0.7 to
> > -0.4 to keep the censoring %'s equal at about 30%. This time the "best
> > possible" estimate of "beta_t_1" was -1.53 which was similar to that
> > obtained previously. The informative censoring gave an estimate for
> > "beta_t_1" of -1.49 whereas the non-informative censoring gave -1.53 -
> this
> > time the non-informative censoring attains the "best possible" but the
> > non-informative censoring does not.
> >
> >
> >
> > I then instead changed "beta_t_2" from 1 to 7 and "beta_c_0" from 0.2 to
> 2,
> > and "beta_d_0" from -0.7 to -1.9 again to keep the censoring %'s equal at
> > about 30%. This time the "best possible" estimate of "beta_t_1" was
> -0.999
> > which is pretty much equal to the true value of -1. The informative
> > censoring gave an estimate for "beta_t_1" of -1.09 whereas the
> > non-informative censoring gave -0.87 – surprisingly this time the
> > informative censoring is slightly closer to the “best possible” than the
> > non-informative censoring.
> >
> >
> >
> > To investigate the Z_i issue I generated it from a normal distribution
> with
> > mean 1 and variance 1. I changed "beta_c_0 " from 0.2 to -0.5 to keep the
> > censoring levels equal (this time about 30% for both). This time the
> "best
> > possible" estimate was -1.98 which was further from -1 than previous
> > examples. The informative censoring gave an estimate for "beta_t_1" of
> -1.81
> > whereas the non-informative censoring gave -1.84. So again the
> informative
> > censoring gives an estimate closer to the "best possible" when compared
> with
> > the informative censoring, but this time it does not attain the "best
> > possible".
> >
> > In conclusion it is clear to me that a stronger Z_i effect in the
> censoring
> > model causes the informative censoring to be worse than the
> non-informative
> > one (as expected), but a stronger Z_i effect in the event model does not
> > have this effect and even causes the independent censoring to be worse –
> > this in general may not hold but I nonetheless see it here. I am
> wondering
> > if this is because altering the treatment effect in the event model also
> > affects the independent censoring process and so it “muddies the waters”
> > whereas altering the treatment effect in the informative censoring model
> > obviously confines the changes to just the informative censoring process.
> > For a fixed treatment effect size in both the event and informative
> > censoring models the effect of Z_i having a wider range than is possible
> > under the beta distribution also appears to produce informative censoring
> > that is worse than the non-informative one. This makes sense I think
> because
> > the Z_i-response relationship must be more informative?
> >
> >
> >
> > Thanks for your suggestion of copulas – I have not come across these. Is
> > this similar to assuming a event model for censored subjects (this is
> > unobserved) – i.e. if the event model is different conditional on
> censoring
> > then if we could observe the events beyond censoring then clearly the
> > parameter estimates would be different compared to those obtained when
> > modelling only non-censored times?
> >
> >
> >
> >
> > On Wed, Jul 29, 2015 at 5:37 PM, Greg Snow <538280 at gmail.com> wrote:
> >>
> >> As models become more complex it becomes harder to distinguish
> >> different parts and their effects. Even for a straight forward linear
> >> regression model if X1 and X2 are correlated with each other then it
> >> becomes difficult to distinguish between the effects of X1^2, X2^2,
> >> and X1*X2. In your case the informative censoring and model
> >> misspecification are becoming hard to distinguish (and it could be
> >> argued that having informative censoring is really just a form of
> >> model misspecification). So I don't think so much that you are doing
> >> things wrong, just that you are learning that the models are complex.
> >>
> >> Another approach to simulation that you could try is to simulate the
> >> event time and censoring time using copulas (and therefore they can be
> >> correlated to give informative censoring, but without relying on a
> >> term that you could have included in the model) then consider the
> >> event censored if the censoring time is shorter.
> >>
> >> On Fri, Jul 24, 2015 at 10:14 AM, Daniel Meddings <dpmeddings at gmail.com
> >
> >> wrote:
> >> > Hi Greg
> >> >
> >> > Many thanks for taking the time to respond to my query. You are right
> >> > about
> >> > pointing out the distinction between what variables are and are not
> >> > included
> >> > in the event times process and in the censoring process. I clearly
> >> > forgot
> >> > this important aspect. I amended my code to do as you advise and now I
> >> > am
> >> > indeed getting biased estimates when using the informatively censored
> >> > responses. The problem is now that the estimates from the
> independently
> >> > censored responses are the same - i.e. they are just as biased. Thus
> the
> >> > bias seems to be due entirely to model mis-specification and not the
> >> > informative censoring.
> >> >
> >> >
> >> > To give a concrete example I simulate event times T_i and censoring
> >> > times
> >> > C_i from the following models;
> >> >
> >> >
> >> > T_i~ Weibull(lambda_t(x),v_t), lambda_t(x)=lambda_t*exp( beta_t_0 +
> >> > (beta_t_1*Treat_i) + (beta_t_2*Z_i) + (beta_t_3*Treat_i*Z_i) )
> >> >
> >> > C_i~ Weibull(lambda_c(x),v_c), lambda_c(x)=lambda_c*exp( beta_c_0 +
> >> > (beta_c_1*Treat_i) + (beta_c_2*Z_i) + (beta_c_3*Treat_i*Z_i) )
> >> >
> >> > D_i~Weibull(lambda_d(x),v_D), lambda_d(x)=lamda_d*exp( beta_d_0)
> >> >
> >> > where ;
> >> >
> >> > beta_t_0 = 1, beta_t_1 = -1, beta_t_2 = 1, beta_t_3 = -2,
> >> > lambda_t=0.5
> >> >
> >> > beta_c_0 = 0.2, beta_c_1 = -2, beta_c_2 = 2, beta_c_3 = -2,
> >> > lambda_c=0.5
> >> >
> >> > beta_d_0 = -0.7, lambda_d=0.1
> >> >
> >> > When I fit the cox model to both the informatively censored responses
> >> > and
> >> > the independent censored responses I include only the Treatment
> >> > covariate in
> >> > the model.
> >> >
> >> > I simulate Treatment from a Bernoulli distribution with p=0.5 and Z_i
> >> > from a
> >> > beta distribution so that Z ranges from 0 to 1 (I like to think of Z
> as
> >> > a
> >> > "poor" prognosis probability where Z_i=1 means a subject is 100%
> certain
> >> > to
> >> > have a poor prognosis and Z_i=0 means zero chance). These parameter
> >> > choices
> >> > give approximately 27% and 25% censoring for the informatively
> censored
> >> > responses (using C_i) and the independent censored responses (using
> D_i)
> >> > respectively. I use N=2000 subjects and 2000 simulation replications.
> >> >
> >> > The above simulation I get estimates of beta_t_2 of -1.526 and -1.537
> >> > for
> >> > the informatively censored responses and the independent censored
> >> > responses
> >> > respectively.
> >> >
> >> > Furthermore when I fit a cox model to the full responses (no censoring
> >> > at
> >> > all) I get an estimate of beta_t_2 of -1.542. This represents the best
> >> > that
> >> > can possibly be done given that Z and Treat*Z are not in the model.
> >> > Clearly
> >> > censoring is not making much of a difference here - model
> >> > mis-specification
> >> > dominates.
> >> >
> >> > I still must be doing something wrong but I cannot figure this one
> out.
> >> >
> >> > Thanks
> >> >
> >> > Dan
> >> >
> >> >
> >> >
> >> > On Thu, Jul 23, 2015 at 12:33 AM, Greg Snow <538280 at gmail.com> wrote:
> >> >>
> >> >> I think that the Cox model still works well when the only information
> >> >> in the censoring is conditional on variables in the model. What you
> >> >> describe could be called non-informative conditional on x.
> >> >>
> >> >> To really see the difference you need informative censoring that
> >> >> depends on something not included in the model. One option would be
> >> >> to use copulas to generate dependent data and then transform the
> >> >> values using your Weibul. Or you could generate your event times and
> >> >> censoring times based on x1 and x2, but then only include x1 in the
> >> >> model.
> >> >>
> >> >> On Wed, Jul 22, 2015 at 2:20 AM, Daniel Meddings <
> dpmeddings at gmail.com>
> >> >> wrote:
> >> >> > I wish to simulate event times where the censoring is informative,
> >> >> > and
> >> >> > to
> >> >> > compare parameter estimator quality from a Cox PH model with
> >> >> > estimates
> >> >> > obtained from event times generated with non-informative censoring.
> >> >> > However
> >> >> > I am struggling to do this, and I conclude rather than a technical
> >> >> > flaw
> >> >> > in
> >> >> > my code I instead do not understand what is meant by informative
> and
> >> >> > un-informative censoring.
> >> >> >
> >> >> > My approach is to simulate an event time T dependent on a vector of
> >> >> > covariates x having hazard function
> >> >> > h(t|x)=lambda*exp(beta'*x)v*t^{v-1}.
> >> >> > This corresponds to T~ Weibull(lambda(x),v), where the scale
> >> >> > parameter
> >> >> > lambda(x)=lambda*exp(beta'*x) depends on x and the shape parameter
> v
> >> >> > is
> >> >> > fixed. I have N subjects where T_{i}~ Weibull(lambda(x_{i}),v_{T}),
> >> >> > lambda(x_{i})=lambda_{T}*exp(beta_{T}'*x_{i}), for i=1,...,N. Here
> I
> >> >> > assume
> >> >> > the regression coefficients are p-dimensional.
> >> >> >
> >> >> > I generate informative censoring times C_i~
> Weibull(lambda(x_i),v_C),
> >> >> > lambda(x_i)=lambda_C*exp(beta_C'*x_i) and compute
> >> >> > Y_inf_i=min(T_i,C_i)
> >> >> > and
> >> >> > a censored flag delta_inf_i=1 if Y_inf_i <= C_i (an observed
> event),
> >> >> > and
> >> >> > delta_inf_i=0 if Y_inf_i > C_i (informatively censored: event not
> >> >> > observed). I am convinced this is informative censoring because as
> >> >> > long
> >> >> > as
> >> >> > beta_T~=0 and beta_C~=0 then for each subject the data generating
> >> >> > process
> >> >> > for T and C both depend on x.
> >> >> >
> >> >> > In contrast I generate non-informative censoring times
> >> >> > D_i~Weibull(lambda_D*exp(beta_D),v_D), and compute
> >> >> > Y_ninf_i=min(T_i,D_i)
> >> >> > and a censored flag delta_ninf_i=1 if Y_ninf_i <= D_i (an observed
> >> >> > event),
> >> >> > and delta_ninf_i=0 if Y_ninf_i > D_i (non-informatively censored:
> >> >> > event
> >> >> > not
> >> >> > observed). Here beta_D is a scalar. I "scale" the simulation by
> >> >> > choosing
> >> >> > the lambda_T, lambda_C and lambda_D parameters such that on average
> >> >> > T_i<C_i
> >> >> > and T_i<D_i to achieve X% of censored subjects for both Y_inf_i and
> >> >> > Y_ninf_i.
> >> >> >
> >> >> > The problem is that even for say 30% censoring (which I think is
> >> >> > high),
> >> >> > the
> >> >> > Cox PH parameter estimates using both Y_inf and Y_ninf are unbiased
> >> >> > when
> >> >> > I
> >> >> > expected the estimates using Y_inf to be biased, and I think I see
> >> >> > why:
> >> >> > however different beta_C is from beta_T, a censored subject can
> >> >> > presumably
> >> >> > influence the estimation of beta_T only by affecting the set of
> >> >> > subjects
> >> >> > at
> >> >> > risk at any time t, but this does not change the fact that every
> >> >> > single
> >> >> > Y_inf_i with delta_inf_i=1 will have been generated using beta_T
> >> >> > only.
> >> >> > Thus
> >> >> > I do not see how my simulation can possibly produce biased
> estimates
> >> >> > for
> >> >> > beta_T using Y_inf.
> >> >> >
> >> >> > But then what is informative censoring if not based on this
> approach?
> >> >> >
> >> >> > Any help would be greatly appreciated.
> >> >> >
> >> >> > [[alternative HTML version deleted]]
> >> >> >
> >> >> > ______________________________________________
> >> >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> > PLEASE do read the posting guide
> >> >> > http://www.R-project.org/posting-guide.html
> >> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Gregory (Greg) L. Snow Ph.D.
> >> >> 538280 at gmail.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Gregory (Greg) L. Snow Ph.D.
> >> 538280 at gmail.com
> >
> >
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538280 at gmail.com
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list