[R] survexp with large dataframes
Terry Therneau
therneau at mayo.edu
Mon Oct 3 16:05:39 CEST 2011
I've re-looked at survexp with the question of efficiency. As it
stands, the code will have 3-4 (I think it's 4) active copies of the X
matrix at one point; this is likely the reason it takes so much memory
when you have a large data set.
Some of this is history; key parts of the code were written long
before I understood all the "tricks" for smaller memory in S (Splus or
R), 1 copy is the loss of the COPY= argument when going from Splus to R.
I can see how to redo it and reduce to 1 copy, but this involves 3 R
functions and 3 C routines. I'll add it to my list but don't expect
quick results due to a long list in front of it. It's been a good
summer, but as one of my colleagues put it "No vacation goes
unpunished."
As a mid term suggestion I would use a subsample of your data. With the
data set sizes you describe a 20% subsample will give all the precision
that you need. Specifically:
1. Save the results of your current Cox model, call it fit1
2. Select a subset.
3. Fit a new Cox model on the subset, with the options
iter=0, init=fit1$coef
This ensures that the subset has exactly the same coefficients as the
original.
4. Use survexp on the subset fit.
Terry Therneau
More information about the R-help
mailing list