[R] Logistic regression problem
Pedro.Rodriguez at sungard.com
Pedro.Rodriguez at sungard.com
Wed Oct 1 19:12:06 CEST 2008
Hi Bernardo,
Do you have to use logistic regression? If not, try Random Forests... It has worked for me in past situations when I have to analyze huge datasets.
Some want to understand the DGP with a simple linear equation; others want high generalization power. It is your call... See, e.g., www.cis.upenn.edu/group/datamining/ReadingGroup/papers/breiman2001.pdf.
Maybe you are also interested in AD-HOC, an algorithm for feature selection, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.9130
Regards,
Pedro
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Liaw, Andy
Sent: Wednesday, October 01, 2008 12:01 PM
To: Frank E Harrell Jr; tura at centroin.com.br
Cc: r-help at r-project.org
Subject: Re: [R] Logistic regression problem
From: Frank E Harrell Jr
>
> Bernardo Rangel Tura wrote:
> > Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> >> Bernardo Rangel Tura wrote:
> >>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> >>>> I have a huge data set with thousands of variable and one binary
> >>>> variable. I know that most of the variables are
> correlated and are not
> >>>> good predictors... but...
> >>>>
> >>>> It is very hard to start modeling with such a huge
> dataset. What would
> >>>> be your suggestion. How to make a first cut... how to
> eliminate most
> >>>> of the variables but not to ignore potential interactions... for
> >>>> example, maybe variable A is not good predictor and
> variable B is not
> >>>> good predictor either, but maybe A and B together are good
> >>>> predictor...
> >>>>
> >>>> Any suggestion is welcomed
> >>>
> >>> milicic.marko
> >>>
> >>> I think do you start with a rpart("binary variable"~.)
> >>> This show you a set of variables to start a model and the
> start set to
> >>> curoff for continous variables
> >> I cannot imagine a worse way to formulate a regression
> model. Reasons
> >> include
> >>
> >> 1. Results of recursive partitioning are not trustworthy
> unless the
> >> sample size exceeds 50,000 or the signal to noise ratio is
> extremely high.
> >>
> >> 2. The type I error of tests from the final regression
> model will be
> >> extraordinarily inflated.
> >>
> >> 3. False interactions will appear in the model.
> >>
> >> 4. The cutoffs so chosen will not replicate and in effect
> assume that
> >> covariate effects are discontinuous and piecewise flat.
> The use of
> >> cutoffs results in a huge loss of information and power
> and makes the
> >> analysis arbitrary and impossible to interpret (e.g., a
> high covariate
> >> value:low covariate value odds ratio or mean difference is
> a complex
> >> function of all the covariate values in the sample).
> >>
> >> 5. The model will not validate in new data.
> >
> > Professor Frank,
> >
> > Thank you for your explain.
> >
> > Well, if my first idea is wrong what is your opinion on the
> following
> > approach?
> >
> > 1- Make PCA with data excluding the binary variable
> > 2- Put de principal components in logistic model
> > 3- After revert principal componentes in variable (only if is
> > interesting for milicic.marko)
> >
> > If this approach is wrong too what is your approach?
>
>
> Hi Bernardo,
>
> If there is a large number of potential predictors and no previous
> knowledge to guide the modeling, principal components (PC) is
> often an
> excellent way to proceed. The first few PCs can be put into
> the model.
> The result is not always very interpretable, but you can
> "decode" the
> PCs by using stepwise regression or recursive partitioning (which are
> safer in this context because the stepwise methods are not exposed to
> the Y variable). You can also add PCs in a stepwise fashion in the
> pre-specified order of variance explained.
>
> There are many variations on this theme including nonlinear principal
> components (e.g., the transcan function in the Hmisc package)
> which may
> explain more variance of the predictors.
While I agree with much of what Frank had said, I'd like to add some points.
Variable selection is a treacherous business whether one is interested in
prediction or inference. If the goal is inference, Frank's book is a
must read, IMHO. (It's great for predictive model building, too.)
If interaction is of high interest, principal components are not going
to give you that.
Regarding cutpoint selection: The machine learners have found that
the `optimal' split point for a continuous predictor in tree algorithms
are extremely variable, that interpreting them would be risky at best.
Breiman essentially gave up on interpretation of a single tree when he
went to random forests.
Best,
Andy
> Frank
> --
> Frank E Harrell Jr Professor and Chair School of Medicine
> Department of Biostatistics
> Vanderbilt University
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Notice: This e-mail message, together with any attachme...{{dropped:12}}
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list