[R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

Wed Mar 25 14:03:06 CET 2009

Ravi Varadhan wrote:
> Fine detective work, David.  Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995).    
> 
> Best,
> Ravi.

Yes that is a big frustration for me, even for projects for which I was 
the principal statistician in 1990 for which I did a poor job of 
archiving excellent medical datasets for future use.  This is a big 
advertisement for the reproducible research movement.

David - fantastic job.  Based on what you found, the version on our web 
site looks as good as any.  Now if someone can explain to me why you see 
a spike near a serum prostate acid phosphatase (AP) value of 1 when you 
use a flexible regression model (e.g., restricted cubic spline) to 
relate AP to the log hazard of death in a survival model (see p. 518 in 
my book), that would be very helpful.

If you do with(prostate,plot(supsmu(log(ap),1*(status!='alive')))) you 
see a minimum at ap=2.37 after anti-logging.  If you do

dd <- datadist(prostate); options(datadist='dd')
f <- cph(Surv(dtime,status!='alive') ~ rcs(log(ap),6), data=prostate)
plot(f)

you see a sharp minimum at ap=1.43.  With 4 knots the min is a 1.18. 
You have to go to 3 knots to get a monotonic fit in log(ap) but AIC is 
not as good.

Frank

> 
> ____________________________________________________________________
> 
> Ravi Varadhan, Ph.D.
> Assistant Professor,
> Division of Geriatric Medicine and Gerontology
> School of Medicine
> Johns Hopkins University
> 
> Ph. (410) 502-2619
> email: rvaradhan at jhmi.edu
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> Date: Tuesday, March 24, 2009 10:54 pm
> Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
> To: Rolf Turner <r.turner at auckland.ac.nz>
> Cc: R-help Forum <r-help at r-project.org>, Ravi Varadhan <rvaradhan at jhmi.edu>
> 
> 
>>  On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:
>>  
>>  >
>>  > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:
>>  >
>>  > 	<snip>
>>  >
>>  >>> (2) Scrolling down to ``Byar and Green prostate cancer data''  
>>  >>> appeared
>>  >>> to get
>>  >>> me to the right place.  But I couldn't see any signs of any ``R  
>>
>>  >>> binary
>>  >>> files''.
>>  >>
>>  >> Please look again.  It's under the heading "R".  Unfortunately I used
>>  >> .sav suffix for save() files in the old days.
>>  >
>>  > 	Ah-ha.  Oh me of little faith.  I have been hanging around (in
>>  > 	my current work environment) with too many SPSS users, and the
>>  > 	*.sav extension seems to be the standard for SPSS data files.
>>  > 	Whence my corrupted thinking.
>>  >
>>  >> The .xls fine opened with no problem in OpenOffice; has 506 rows.
>>  >
>>  > 	Hmmm.  When I opened it with Excel on the Mac I got a spread
>>  > 	sheet with 503 rows --- the first row being the column names,
>>  > 	so there were really 502 rows.
>>  
>>  The last "patnr" is "506" but there are only 502 lines of data. 471,  
>>
>>  473, 475 and 488 are missing.
>>  
>>  And the CMU Statlib version for 2002 looks the same.
>>  
>>  
>>  The version at this site is missing more than 25 cases:
>>  
>>  
>>  Here are two other copies of the dataset the first of which appears 
>> to  
>>  have those missing cases:
>>  This one has patient numbers:
>>  
>>  
>>  This one has a description of the fields and cites the one above but  
>>
>>  has not retained the patient numbers and has apparently only kept the 
>>  
>>  475 cases with complete data.
>>  
>>  
>>  
>>  >
>>  
>>  David Winsemius, MD
>>  Heritage Laboratories
>>  West Hartford, CT
>>  
>>  ______________________________________________
>>  R-help at r-project.org mailing list
>>  
>>  PLEASE do read the posting guide 
>>  and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University