[R] logistic regression with 50 varaibales
Claudia Beleites
cbeleites at units.it
Mon Jun 14 16:36:02 CEST 2010
Dear all,
(this first part of the email I sent to John earlier today, but forgot to put it
to the list as well)
Dear John,
> Hi, this is not R technical question per se. I know there are many excellent
> statisticians in this list, so here my questions: I have dataset with ~1800
> observations and 50 independent variables, so there are about 35 samples per
> variable. Is it wise to build a stable multiple logistic model with 50
> independent variables? Any problem with this approach? Thanks
First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models with far less than 1800 samples and
far more variates (e.g. 75 patients / 256 spectral measurement channels). Though
I have many measurements per sample: typically several hundred spectra per sample.
Question: are the 1800 real, independent samples?
Model stability is something you can measure.
Do a honest validation of your model with really _independent_ test data and
measure the stability according to what your stability needs are (e.g. stable
parameters or stable predictions?).
(From here on reply to Joris)
> Marcs explanation is valid to a certain extent, but I don't agree with
> his conclusion. I'd like to point out "the curse of
> dimensionality"(Hughes effect) which starts to play rather quickly.
No doubt.
> The curse of dimensionality is easily demonstrated looking at the
> proximity between your datapoints. Say we scale the interval in one
> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> distance between the observations is 0.05 units. To have a proximity
> like that in a 2-dimensional space, you need 20^2=400 observations. in
> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> distance between your observations is important, as a sparse dataset
> will definitely make your model misbehave.
But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.
However, the required sample size may grow substantially slower, if the model
has appropriate restrictions. I remember the recommendation of "at least 5
samples per class and variate" for linear classification models. I.e. not to get
a good model, but to have a reasonable chance of getting a stable model.
> Even with about 35 samples per variable, using 50 independent
> variables will render a highly unstable model,
Am I wrong thinking that there may be a substantial difference between stability
of predictions and stability of model parameters?
BTW: if the models are unstable, there's also aggregation.
At least for my spectra I can give toy examples with physical-chemical
explanation that yield the same prediction with different parameters (of course
because of correlation).
> as your dataspace is
> about as sparse as it can get. On top of that, interpreting a model
> with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on the meaning of the variables.
E.g. for the spectra a set of model parameters may be interpreted like spectra
or difference spectra. Of course this has to do with the fact, that a parallel
coordinate plot is the more "natural" view of spectra compared to a point in so
many dimensions.
> and then I didn't even start
> on interactions. No point in trying I'd say. If you really need all
> that information, you might want to take a look at some dimension
> reduction methods first.
Which puts to my mind a question I've had since long:
I assume that all variables that I know beforehand to be without information are
already discarded.
The dimensionality is then further reduced in a data-driven way (e.g. by PCA or
PLS). The model is built in the reduced space.
How much less samples are actually needed, considering the fact that the
dimension reduction is a model estimated on the data?
...which of course also means that the honest validation embraces the
data-driven dimensionality reduction as well...
Are there recommendations about that?
The other curious question I have is:
I assume that it is impossible for him to obtain the 10^xy samples required for
comfortable model building.
So what is he to do?
Cheers,
Claudia
--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste
phone: +39 0 40 5 58-37 68
email: cbeleites at units.it
More information about the R-help
mailing list