[R] Construction of Dataset for time varying COXPH analysis

yongchuan panyc at pacific.net.sg
Mon Oct 23 18:38:37 CEST 2006


Question: When survfit() function is used upon a coxph object, the 'n' returned is vastly smaller (n=6) than the number of distinct loans in the dataset used. 

I am trying to estimate a Cox proportional hazards model for a set of loans (over 6000) using using time varying covariates. For this 6000+ loans, I have some 62,000 different vectors representing the loans at different periods of time. I did the following:

resultsOpt <- coxph(Surv(Start,Stop,PrepayDate)~ closingCoupon + loanPurposeId, data=latest)

which returned:

Call:
coxph(formula = Surv(Start, Stop, PrepayDate) ~ closingCoupon + 
    loanPurposeId, data = latest)


               coef exp(coef) se(coef)    z       p
closingCoupon 0.101      1.11   0.0271 3.73 1.9e-04
loanPurposeId 0.434      1.54   0.0624 6.96 3.3e-12

Likelihood ratio test=50.3  on 2 df, p=1.18e-11  n= 62297 


which seems fair.


However when I do:

> survfit(resultsOpt)
Call: survfit.coxph(object = resultsOpt)

      n  events  median 0.95LCL 0.95UCL 
      6     489     Inf     Inf     Inf 

the n = 6 when the number of distinct loans in the dataset is more like 6554.

My dataset looks like the following when I call it from within R:

> latest[1:5, 1:5]
  Start Stop PrepayDate modBalance closingCoupon
1     6    7          0   811.2769          8.35
2     7    8          0   811.2769          8.35
3     8    9          1   811.2769          8.35
4     4    5          0  2226.0825          8.70
5     5    6          0  2226.0825          8.70


where the first 3 rows present 1 loan, and the next 2 loans a new one. Am I putting the data in an incorrect format, and if so how should I correct it? Thanks much.

Pan



More information about the R-help mailing list