[R] Construction of Dataset for time varying COXPH analysis
yongchuan
panyc at pacific.net.sg
Mon Oct 23 18:38:37 CEST 2006
Question: When survfit() function is used upon a coxph object, the 'n' returned is vastly smaller (n=6) than the number of distinct loans in the dataset used.
I am trying to estimate a Cox proportional hazards model for a set of loans (over 6000) using using time varying covariates. For this 6000+ loans, I have some 62,000 different vectors representing the loans at different periods of time. I did the following:
resultsOpt <- coxph(Surv(Start,Stop,PrepayDate)~ closingCoupon + loanPurposeId, data=latest)
which returned:
Call:
coxph(formula = Surv(Start, Stop, PrepayDate) ~ closingCoupon +
loanPurposeId, data = latest)
coef exp(coef) se(coef) z p
closingCoupon 0.101 1.11 0.0271 3.73 1.9e-04
loanPurposeId 0.434 1.54 0.0624 6.96 3.3e-12
Likelihood ratio test=50.3 on 2 df, p=1.18e-11 n= 62297
which seems fair.
However when I do:
> survfit(resultsOpt)
Call: survfit.coxph(object = resultsOpt)
n events median 0.95LCL 0.95UCL
6 489 Inf Inf Inf
the n = 6 when the number of distinct loans in the dataset is more like 6554.
My dataset looks like the following when I call it from within R:
> latest[1:5, 1:5]
Start Stop PrepayDate modBalance closingCoupon
1 6 7 0 811.2769 8.35
2 7 8 0 811.2769 8.35
3 8 9 1 811.2769 8.35
4 4 5 0 2226.0825 8.70
5 5 6 0 2226.0825 8.70
where the first 3 rows present 1 loan, and the next 2 loans a new one. Am I putting the data in an incorrect format, and if so how should I correct it? Thanks much.
Pan
More information about the R-help
mailing list