[R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Mar 25 14:03:06 CET 2009
Ravi Varadhan wrote:
> Fine detective work, David. Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995).
Yes that is a big frustration for me, even for projects for which I was
the principal statistician in 1990 for which I did a poor job of
archiving excellent medical datasets for future use. This is a big
advertisement for the reproducible research movement.
David - fantastic job. Based on what you found, the version on our web
site looks as good as any. Now if someone can explain to me why you see
a spike near a serum prostate acid phosphatase (AP) value of 1 when you
use a flexible regression model (e.g., restricted cubic spline) to
relate AP to the log hazard of death in a survival model (see p. 518 in
my book), that would be very helpful.
If you do with(prostate,plot(supsmu(log(ap),1*(status!='alive')))) you
see a minimum at ap=2.37 after anti-logging. If you do
dd <- datadist(prostate); options(datadist='dd')
f <- cph(Surv(dtime,status!='alive') ~ rcs(log(ap),6), data=prostate)
you see a sharp minimum at ap=1.43. With 4 knots the min is a 1.18.
You have to go to 3 knots to get a monotonic fit in log(ap) but AIC is
not as good.
> Ravi Varadhan, Ph.D.
> Assistant Professor,
> Division of Geriatric Medicine and Gerontology
> School of Medicine
> Johns Hopkins University
> Ph. (410) 502-2619
> email: rvaradhan at jhmi.edu
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> Date: Tuesday, March 24, 2009 10:54 pm
> Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
> To: Rolf Turner <r.turner at auckland.ac.nz>
> Cc: R-help Forum <r-help at r-project.org>, Ravi Varadhan <rvaradhan at jhmi.edu>
>> On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:
>> > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:
>> > <snip>
>> >>> (2) Scrolling down to ``Byar and Green prostate cancer data''
>> >>> appeared
>> >>> to get
>> >>> me to the right place. But I couldn't see any signs of any ``R
>> >>> binary
>> >>> files''.
>> >> Please look again. It's under the heading "R". Unfortunately I used
>> >> .sav suffix for save() files in the old days.
>> > Ah-ha. Oh me of little faith. I have been hanging around (in
>> > my current work environment) with too many SPSS users, and the
>> > *.sav extension seems to be the standard for SPSS data files.
>> > Whence my corrupted thinking.
>> >> The .xls fine opened with no problem in OpenOffice; has 506 rows.
>> > Hmmm. When I opened it with Excel on the Mac I got a spread
>> > sheet with 503 rows --- the first row being the column names,
>> > so there were really 502 rows.
>> The last "patnr" is "506" but there are only 502 lines of data. 471,
>> 473, 475 and 488 are missing.
>> And the CMU Statlib version for 2002 looks the same.
>> The version at this site is missing more than 25 cases:
>> Here are two other copies of the dataset the first of which appears
>> have those missing cases:
>> This one has patient numbers:
>> This one has a description of the fields and cites the one above but
>> has not retained the patient numbers and has apparently only kept the
>> 475 cases with complete data.
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>> R-help at r-project.org mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help