[R] Guidance on step() with large dataset (750K) solicited...

Fri Apr 14 09:41:54 CEST 2006

On Thu, 13 Apr 2006, roger koenker wrote:

> Jeff,
>
> I don't know whether this is likely to be feasible, but if you could
> replace calls to lm() with calls to a sparse matrix version of lm()
> either slm() in SparseM or something similar in Matrix, then I
> would think that you should safe from memory problems.  Adapting step
> might be more than you really bargained for though, I don't
> know the code....

It's a simple wrapper that has been used for many model-fitting classes.
All you need is an extractAIC method.

>
> Roger
>
> url:    www.econ.uiuc.edu/~roger            Roger Koenker
> email    rkoenker at uiuc.edu            Department of Economics
> vox:     217-333-4558                University of Illinois
> fax:       217-244-6678                Champaign, IL 61820
>
>
> On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:
>
>> Hi.
>>
>> Background - I am working with a dataset involving around 750K
>> observations, where many of the variables (8/11) are unordered
>> factors.
>>
>> The typical model used to model this relationship in the literature
>> has
>> been a simple linear additive model, but this is rejected out of
>> hand by
>> the data. I was asked to model this via kernel methods, but first
>> wanted
>> to play with the parametric specification out of curiosity.
>>
>> I thought it would be interesting to see what type of model
>> stepwise BIC
>> would yield, and have been playing with the step() function (on R-beta
>> due to the factor.scope() problem that has been fixed in the
>> patched and
>> beta version).
>>
>> I am running this on a 64bit box with 32GB of RAM and tons of swap,
>> but
>> am hitting the memory wall as occasionally memory needs grow to
>> ungodly
>> proportions (in the early iterations the program starts out around 8GB
>> but quickly grows to 15GB, then grows from there). This is not due
>> to my
>> using the beta version, as this also arises under R-2.2.1 for what
>> that
>> is worth.
>>
>> My question is whether or not there is some simple way to
>> substantially
>> reduce the memory footprint for this procedure. I took a look at
>> previous posts for step() and memory issues, but still wonder whether
>> there might be a switch or possibly better way of constructing my
>> model
>> that would overcome the memory issues.
>>
>> I include the code below, and any comments or suggestions would be
>> most
>> welcome (besides `what type of idiot lets information criteria
>> determine
>> their model ;-)')
>>
>> Thanks ever so much in advance.
>>
>> -- Jeff
>>
>> ---- Begin ----
>>
>> ## Read in the full data set (n=745466 observations)
>>
>> data <- read.table("../data_header.dat",header=TRUE)
>>
>> ## Create a data frame with all categorical variables declared as
>> ## unordered factors
>>
>> data <- data.frame(logrprice=data$logrprice,
>>                    cgt=factor(data$cgt),
>>                    cag=factor(data$cag),
>>                    gstann=factor(data$gstann),
>>                    fhogann=factor(data$fhogann),
>>                    gstfhog=factor(data$gstfhog),
>>                    luc=factor(data$luc),
>>                    municipality=factor(data$municipality),
>>                    time=factor(data$time),
>>                    distance=data$distance,
>>                    logr=data$logr,
>>                    loginc=data$loginc)
>>
>> ## Estimate a simple linear model (used repeatedly in the literature,
>> ## fails the most simple of model specification tests e.g.,
>> ## resettest())
>>
>> model.linear <- lm(logrprice~.,data=data)
>>
>> ## Now conduct stepwise (BIC) regression using the step() function in
>> ## the stats library. The lower model is the unconditional mean of y,
>> ## the upper having polynomials of up to order 6 in the three
>> ## continuous covariates, with interaction among all variables of
>> ## order 2.
>>
>> n <- nrow(data)
>>
>> model.bic <- step(model.linear,
>>                   scope=list(
>>                     lower=~ 1,
>>                     upper=~ (.
>>                              +I(logr^2)
>>                              +I(logr^3)
>>                              +I(logr^4)
>>                              +I(logr^5)
>>                              +I(logr^6)
>>                              +I(distance^2)
>>                              +I(distance^3)
>>                              +I(distance^4)
>>                              +I(distance^5)
>>                              +I(distance^6)
>>                              +I(loginc^2)
>>                              +I(loginc^3)
>>                              +I(loginc^4)
>>                              +I(loginc^5)
>>                              +I(loginc^6))
>>                     ^2),
>>                   trace=TRUE,
>>                   k=log(n)
>>                   )
>>
>> summary(model.bic)
>>
>> ---- End ----
>> --
>> Professor J. S. Racine         Phone:  (905) 525 9140 x 23825
>> Department of Economics        FAX:    (905) 521-8232
>> McMaster University            e-mail: racinej at mcmaster.ca
>> 1280 Main St. W.,Hamilton,     URL:
>> http://www.economics.mcmaster.ca/racine/
>> Ontario, Canada. L8S 4M4
>>
>> `The generation of random numbers is too important to be left to
>> chance.'
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-
>> guide.html
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595