[R] logistic regression with 50 varaibales
Robert A LaBudde
ral at lcfltd.com
Mon Jun 14 17:04:16 CEST 2010
I think the real issue is why the fit is being
done. If it is solely to interpolate and condense
the dataset, the number of variables is not an important issue.
If the issue is developing a model that will
capture causality, it is hard to believe that can
be accomplished with 50+ variables. With this
many, some kind of hunt would have to be done,
and the resulting model would not be real stable.
It would be better perhaps to first reduce the
variable set by, say, principal components
analysis, so that a reasonable sized set results.
If a stable and meaningful model is the goal,
each term in the final model should be plausibly causal.
At 10:36 AM 6/14/2010, Claudia Beleites wrote:
>Dear all,
>
>(this first part of the email I sent to John
>earlier today, but forgot to put it to the list as well)
>Dear John,
>
> > Hi, this is not R technical question per se.
> I know there are many excellent
> > statisticians in this list, so here my questions: I have dataset with ~1800
> > observations and 50 independent variables, so
> there are about 35 samples per
> > variable. Is it wise to build a stable multiple logistic model with 50
> > independent variables? Any problem with this approach? Thanks
>
>First: I'm not a statistician, but a spectroscopist.
>But I do build logistic Regression models with
>far less than 1800 samples and far more variates
>(e.g. 75 patients / 256 spectral measurement
>channels). Though I have many measurements per
>sample: typically several hundred spectra per sample.
>
>Question: are the 1800 real, independent samples?
>
>Model stability is something you can measure.
>Do a honest validation of your model with really
>_independent_ test data and measure the
>stability according to what your stability needs
>are (e.g. stable parameters or stable predictions?).
>
>
>
>(From here on reply to Joris)
>
> > Marcs explanation is valid to a certain extent, but I don't agree with
> > his conclusion. I'd like to point out "the curse of
> > dimensionality"(Hughes effect) which starts to play rather quickly.
>No doubt.
>
> > The curse of dimensionality is easily demonstrated looking at the
> > proximity between your datapoints. Say we scale the interval in one
> > dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> > distance between the observations is 0.05 units. To have a proximity
> > like that in a 2-dimensional space, you need 20^2=400 observations. in
> > a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> > distance between your observations is important, as a sparse dataset
> > will definitely make your model misbehave.
>
>But won't also the distance between groups grow?
>No doubt, that high-dimensional spaces are _very_ unintuitive.
>
>However, the required sample size may grow
>substantially slower, if the model has
>appropriate restrictions. I remember the
>recommendation of "at least 5 samples per class
>and variate" for linear classification models.
>I.e. not to get a good model, but to have a
>reasonable chance of getting a stable model.
>
> > Even with about 35 samples per variable, using 50 independent
> > variables will render a highly unstable model,
>Am I wrong thinking that there may be a
>substantial difference between stability of
>predictions and stability of model parameters?
>
>BTW: if the models are unstable, there's also aggregation.
>
>At least for my spectra I can give toy examples
>with physical-chemical explanation that yield
>the same prediction with different parameters
>(of course because of correlation).
>
> > as your dataspace is
> > about as sparse as it can get. On top of that, interpreting a model
> > with 50 variables is close to impossible,
>No, not necessary. IMHO it depends very much on
>the meaning of the variables. E.g. for the
>spectra a set of model parameters may be
>interpreted like spectra or difference spectra.
>Of course this has to do with the fact, that a
>parallel coordinate plot is the more "natural"
>view of spectra compared to a point in so many dimensions.
>
> > and then I didn't even start
> > on interactions. No point in trying I'd say. If you really need all
> > that information, you might want to take a look at some dimension
> > reduction methods first.
>
>Which puts to my mind a question I've had since long:
>I assume that all variables that I know
>beforehand to be without information are already discarded.
>The dimensionality is then further reduced in a
>data-driven way (e.g. by PCA or PLS). The model is built in the reduced space.
>
>How much less samples are actually needed,
>considering the fact that the dimension
>reduction is a model estimated on the data?
>...which of course also means that the honest
>validation embraces the data-driven dimensionality reduction as well...
>
>Are there recommendations about that?
>
>
>The other curious question I have is:
>I assume that it is impossible for him to obtain
>the 10^xy samples required for comfortable model building.
>So what is he to do?
>
>
>Cheers,
>
>Claudia
>
>
>
>--
>Claudia Beleites
>Dipartimento dei Materiali e delle Risorse Naturali
>Università degli Studi di Trieste
>Via Alfonso Valerio 6/a
>I-34127 Trieste
>
>phone: +39 0 40 5 58-37 68
>email: cbeleites at units.it
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd. URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239 Fax: 757-467-2947
"Vere scire est per causas scire"
================================================================
More information about the R-help
mailing list