[R] How to improve, at all, a simple GLM code

Fri Mar 30 00:11:13 CEST 2012

Hi Ben,

Thank you for all your help so far! I appreciate it.

I am wanting to find a good predictive model, yes. It's part of a project so if I have time after finding the model I may want to find some patterns but it's not a priority. I just want the model for now (I need the coefficients above all).

It's all categorical data, I categorised any continuous data before I started trying to fit the glm.

I was unsure of how to get the csv file to you,however, I have uploaded it and it should be available for download from here:
http://www.filedropper.com/prepareddata

If not, let me know and I can attach it.

Hopefully this explains a bit more of what I am aiming to do.

Thanks again,

AJC

On 29 Mar, 2012,at 10:19 PM, Ben Bolker <bbolker at gmail.com> wrote:

> Abigail Clifton <abigailclifton <at> me.com> writes:
>
>
> > I am trying to fit a logit model to some data in a CSV file in R.
>
> It would be helpful to link back to your previous question:
>
> http://thread.gmane.org/gmane.comp.lang.r.general/259353
>
> > Here is my code:
> >
> > Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
> > Prepared_Data
> > attach(Prepared_Data)
> > lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
> > anova(lrfit, test="Chisq")
> > write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv")
> > shell.exec("CWModelA.csv")
>
> This is still not a reproducible example, although
> it's a little closer. Did you read the "recommended reading"
> in my previous answer???
>
>
> > I am unsure as to how many methods there are of choosing a suitable model,
>
> Lots, and it depends very much on why you are doing the analysis in
> the first place. Are you (1) trying to find a good predictive model?
> (2) Looking for interesting patterns in the data? (3) Trying to test
> hypotheses about which predictors have a significant effect on the
> outcome? (4) Partition the variance explained by different predictors?
>
> > however, I was hoping to fit the
> > full/saturated model and choose the significant terms only as
> > my final model.
>
> In general this is a poor choice for goal #1 above, not necessarily
> bad for #2, absolutely terrible for #3, irrelevant for #4. I'm
> guessing you are interested in the best predictive model, since you
> mentioned something in your previous message about working out the
> probability of default on loan applications. I would say your best
> bet is to use penalized approaches (see the glmnet package, and
> library("sos"); findFn("lasso")).
>
> > My first question therefore: is there a better way to fit a model to
> > some data? Is there a function or way of getting R to print the
> > optimum model?
>
>
> > My CSV file, when opened in excel, contains approximately 3500 rows
> > x 27 columns. I can only seem to run 'anova()' on the saturated/full
> > model including the first four columns/factors. If I take any more
> > into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops
> > responding/I have to force quit. Why is this? How can I get around
> > it as I need to include all 27 columns?
>
> For continuous predictors, the number of parameters of the
> saturated models grows as 2^n; 2^27 is >134 million, so you
> probably don't want to do that. It's potentially even worse
> for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12
> for three-level predictors).
>
> It's still not sufficiently clear why you're having a problem
> because you haven't given enough information: in the example I
> gave in my previous answer, I used 7 continuous variables for
> 128 parameters without too much difficulty, but if you had (say)
> 5 levels for each of 7 predictors then you would be trying
> to estimate 78125 parameters ...
>
> Bottom line, it may simply not be reasonable to fit the
> saturated model. Hard-core machine learning approaches (and
> *maybe* the penalized regression approaches) might be able
> to handle a few thousand predictors for n=3500, but a model
> with tens of thousands of parameters (or more) feels somewhat crazy.
> (Someone else is welcome to tell me how this could be done.)
>
> Ben Bolker
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.