[R] logistic regression with 50 varaibales

Mon Jun 14 16:36:02 CEST 2010

Dear all,

(this first part of the email I sent to John earlier today, but forgot to put it 
to the list as well)
Dear John,

 > Hi, this is not R technical question per se. I know there are many excellent
 > statisticians in this list, so here my questions: I have dataset with ~1800
 > observations and 50 independent variables, so there are about 35 samples per
 > variable. Is it wise to build a stable multiple logistic model with 50
 > independent variables? Any problem with this approach? Thanks

First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models with far less than 1800 samples and 
far more variates (e.g. 75 patients / 256 spectral measurement channels). Though 
I have many measurements per sample: typically several hundred spectra per sample.

Question: are the 1800 real, independent samples?

Model stability is something you can measure.
Do a honest validation of your model with really _independent_ test data and 
measure the stability according to what your stability needs are (e.g. stable 
parameters or stable predictions?).

(From here on reply to Joris)

 > Marcs explanation is valid to a certain extent, but I don't agree with
 > his conclusion. I'd like to point out "the curse of
 > dimensionality"(Hughes effect) which starts to play rather quickly.
No doubt.

 > The curse of dimensionality is easily demonstrated looking at the
 > proximity between your datapoints. Say we scale the interval in one
 > dimension to be 1 unit. If you have 20 evenly-spaced observations, the
 > distance between the observations is 0.05 units. To have a proximity
 > like that in a 2-dimensional space, you need 20^2=400 observations. in
 > a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
 > distance between your observations is important, as a sparse dataset
 > will definitely make your model misbehave.

But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.

However, the required sample size may grow substantially slower, if the model 
has appropriate restrictions. I remember the recommendation of "at least 5 
samples per class and variate" for linear classification models. I.e. not to get 
a good model, but to have a reasonable chance of getting a stable model.

 > Even with about 35 samples per variable, using 50 independent
 > variables will render a highly unstable model,
Am I wrong thinking that there may be a substantial difference between stability 
of predictions and stability of model parameters?

BTW: if the models are unstable, there's also aggregation.

At least for my spectra I can give toy examples with physical-chemical 
explanation that yield the same prediction with different parameters (of course 
because of correlation).

 > as your dataspace is
 > about as sparse as it can get. On top of that, interpreting a model
 > with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on the meaning of the variables. 
E.g. for the spectra a set of model parameters may be interpreted like spectra 
or difference spectra. Of course this has to do with the fact, that a parallel 
coordinate plot is the more "natural" view of spectra compared to a point in so 
many dimensions.

 > and then I didn't even start
 > on interactions. No point in trying I'd say. If you really need all
 > that information, you might want to take a look at some dimension
 > reduction methods first.

Which puts to my mind a question I've had since long:
I assume that all variables that I know beforehand to be without information are 
already discarded.
The dimensionality is then further reduced in a data-driven way (e.g. by PCA or 
PLS). The model is built in the reduced space.

How much less samples are actually needed, considering the fact that the 
dimension reduction is a model estimated on the data?
...which of course also means that the honest validation embraces the 
data-driven dimensionality reduction as well...

Are there recommendations about that?

The other curious question I have is:
I assume that it is impossible for him to obtain the 10^xy samples required for 
comfortable model building.
So what is he to do?

Cheers,

Claudia

-- 
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites at units.it