[R] Discretize continous variables....
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Sun Jul 20 15:11:41 CEST 2008
Johannes Huesing wrote:
> Frank E Harrell Jr <f.harrell at vanderbilt.edu> [Sun, Jul 20, 2008 at 12:20:28AM CEST]:
>> Johannes Huesing wrote:
>>> Because regulatory bodies demand it?
> [...]
>> And how anyway does this
>> relate to predictors in a model?
>
> Not at all; you're correct. I was mixing the topic of this discussion
> up with another kind of silliness.
>
> I had a discussion with a biometrician in a pharmaceutical company
> though who stated that when you have only one df to spend it will be
> better to dichotomise it at a clinically meaningful point than to
> include it as a linear term. He kept the discussion on the ground of
> laboratory measurements like sodium, where a deviation from normal
> ranges is very significant (and unlike, say, cholesterol, where you
> have a gradual interpretation of the value). He has a point there, but
> in general the reason for sacrificing information is a mixture of
> laziness, the preference for presenting data in tables and to keep the
> modelling "consistent" with the tables (for instance to assign an odds
> ratio to each cell).
Nice points. I think the desire to be able to present things in tables
is a major reason.
The biometrician's idea that a piecewise flat line with one jump will
fit a dataset better than a linear effect is quite a leap in logic. If
I only have one d.f. to spend I'll take linear any day, but better to
spend a little more and fit a smooth nonlinear relationship. A coherent
approach is to shrink the fit down to the effective number of parameters
the dataset will support estimating.
There is no clinical laboratory measure that has a jump discontinuity in
its effect on mortality or other patient outcomes. The fact that
reference ranges exist (which are based only on supposedly normal
subjects and don't related to the risk of an outcome) doesn't mean we
should use them in formulated independent or dependent variables.
It is common but distorted logic to want to make an odds ratio in a
model be comparable to one in a table from which regression coefficients
were just anti-logged (so that 1-unit changes could be used). The
tabled odds ratio is a kind of crude population averaged odds ratio that
may not apply to a single subject in the study.
My book has many examples where laboratory measurements are related to
risk using restricted cubic splines.
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list