[R] Antwort: Buying more computer for GLM

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Sep 1 15:07:14 CEST 2006


On Fri, 1 Sep 2006, g.russell at eos-finance.com wrote:

> Prof Brian Ripley wrote

> > 3) As I recall, you were doing model selection via AIC on 20,000 
> >    observations.  You might want to think hard about that, since AIC is 
> >    designed for good prediction.  I would do model exploration on a much 
> 
> >    smaller representative subset, and if I had 20,000 observations and 
> 30 
> >    parameters and was interested in prediction, not do subset selection 
> at 
> >    all.
> 
> One problem is that some of the parameters in the learning set can be 
> very highly correlated (I have no control over the observations), and 
> I'm worried that if I don't prune away parameters which don't improve 
> the log likelihood, my predictions will be busted by inputs which do not 
> exhibit the same linear relationships as those of most of the learning 
> set.  Of course in such a case you'd have to worry about the accuracy of 
> the predictions anyway, but in my job we just have to get make the best 
> predictions we can, even if they aren't perfect.

In that case I would probably not use AIC as my criterion.  Suppose this 
were logistic regression.  Then if I was doing very well in my 
predictions, the AIC would be around 5,000, and I can only reduce it by 60 
by dropping parameters. So variables will be dropped only if they are 
almost completely useless.  I don't think it is a statistical decision as 
to which of two very similar predictors to keep, and in your size of 
problem AIC is quite likely to keep both.

> > 4) glm() alllows you to specify starting parameters, which you could 
> find 
> >    from a subsample.  Very likely only 1 or 2 iterations would be 
> needed.
> 
> This sounds like a good idea, but what in fact I do now is build a model 
> using simple linear
> regression (lm), which is very fast, in the hope that that will pick out 
> the important parameters,
> which I can then feed to glm.

I would not have expected glm to be more than say 5x slower than lm if CPU 
cycles and not memory were the limiting factor.  In that case more RAM 
might be all you need.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list