[R] survexp with large dataframes

Mon Oct 3 16:05:39 CEST 2011

I've re-looked at survexp with the question of efficiency.  As it
stands, the code will have 3-4 (I think it's 4) active copies of the X
matrix at one point; this is likely the reason it takes so much memory
when you have a large data set.
  Some of this is history; key parts of the code were written long
before I understood all the "tricks" for smaller memory in S (Splus or
R), 1 copy is the loss of the COPY= argument when going from Splus to R.

 I can see how to redo it and reduce to 1 copy, but this involves 3 R
functions and 3 C routines.  I'll add it to my list but don't expect
quick results due to a long list in front of it.  It's been a good
summer, but as one of my colleagues put it "No vacation goes
unpunished."

As a mid term suggestion I would use a subsample of your data. With the
data set sizes you describe a 20% subsample will give all the precision
that you need.  Specifically:
   1. Save the results of your current Cox model, call it fit1
   2. Select a subset.
   3. Fit a new Cox model on the subset, with the options
	  iter=0, init=fit1$coef
This ensures that the subset has exactly the same coefficients as the
original.
   4. Use survexp on the subset fit.

Terry Therneau