[R] Help with hazard plots

Marc Schwartz marc_schwartz at comcast.net
Fri Aug 1 06:09:32 CEST 2008


on 07/31/2008 04:29 PM Alan Cox wrote:
> Hello.  I am hoping someone will be willing to help me understand
> something about hazard plots created with muhaz(...).  I have some
> background in statistics (minor in grad school), but I haven't been
> able to figure one thing about hazard plots.  I am using hazard plots
> to track customer cancellations.  I figure I can treat a cancellation
> as a "death", and if someone is still a customer today, they're right
> censored.  I know that a hazard plot shows the probability that
> someone will cancel in month  n  given that they're a customer in
> month n-1 .
> 
> 
> If a customer signs up on January 1st and cancels on January 2nd,
> we've had what I thought was an intellectual but pointless debate
> about whether we count that as being a customer for 1 month or 0
> months.  I thought the two plots would be identical, except for a
> different X axis.
> 
> 
> However, when I create the two plots, they are very different ...
> very, very different.  I've posted the two plots to Flickr:
> 
> 
> http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 0.
> 
> http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 1.
> 
> My question is: Why are these two so different?  How do I know which
> is right?
> 
> The call that I'm making to produce the model is:
> 
> hazardV08 <- muhaz(nmc,s,max.time=max(nmc))


I suspect that there is more here than meets the eye.

Lacking your data and the actual code that you are using to generate the 
two different curves, this could be anything from the way in which you 
have coded/collapsed/truncated the event intervals, to the way in which 
muhaz() is fitting the smoothed curve to each of the two data sets.

The "correct" way to track the intervals would be to use a resolution of 
days, which could be transformed into months and fractions thereof (eg. 
by dividing days by 30.44) if you prefer. The day of sign-up would be 
Day 0 and each subsequent calendar day would increment the interval by 
one day.

So based upon your example above (sign-up on Jan 1, cancel on Jan 2), 
the customer would have an "event" on day 1 or 0.03285151 months.

All of your censored events (clients that have not yet canceled) should 
have their intervals based upon their own Time 0 (sign-up day) to 
whatever date you are using as your end point. I am guessing that you 
might have some form of paid membership, such that as long as the 
customer is paying, they are considered active, as opposed to a customer 
who simply stops doing business with you and you don't know. If the 
customer is paying some type of monthly fee, for example, then you 
should really censor them based upon their last payment date, not 
today's date, since the last payment date is when you know that they are 
still a paying client.

This would be akin to patient coming in for a follow up contact, at 
which point you know they are still alive. Once they leave the office, 
you don't know if they are alive until the next actual contact as they 
might be hit by a car walking to the parking lot.

Based upon your comments above, where you appear to have information on 
a daily basis, if you might be collapsing time into integer months, you 
are losing information. The kernel based approach that is used by 
muhaz() as I understand it, is highly sensitive to small datasets and 
the granularity of the data, among other things.

You might want to review the online complement to MASS4 by Venables and 
Ripley here:

   http://www.stats.ox.ac.uk/pub/MASS4/#Complements

and review the section on survival analysis, which covers smoothing 
functions for survival.

You might also want to simply consider using a standard Kaplan-Meier 
non-parametric estimator using survfit() in the survival package. The 
function calls for your data should be something like:

   library(survival)

   summary(survfit(Surv(nmc, s)))

and

   plot(survfit(Surv(nmc, s)))


HTH,

Marc Schwartz



More information about the R-help mailing list