[R] Guidance on step() with large dataset (750K) solicited...

Thu Apr 13 21:57:00 CEST 2006

Jeff,

I don't know whether this is likely to be feasible, but if you could
replace calls to lm() with calls to a sparse matrix version of lm()
either slm() in SparseM or something similar in Matrix, then I
would think that you should safe from memory problems.  Adapting step
might be more than you really bargained for though, I don't
know the code....

Roger

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    rkoenker at uiuc.edu            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820

On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:

> Hi.
>
> Background - I am working with a dataset involving around 750K
> observations, where many of the variables (8/11) are unordered  
> factors.
>
> The typical model used to model this relationship in the literature  
> has
> been a simple linear additive model, but this is rejected out of  
> hand by
> the data. I was asked to model this via kernel methods, but first  
> wanted
> to play with the parametric specification out of curiosity.
>
> I thought it would be interesting to see what type of model  
> stepwise BIC
> would yield, and have been playing with the step() function (on R-beta
> due to the factor.scope() problem that has been fixed in the  
> patched and
> beta version).
>
> I am running this on a 64bit box with 32GB of RAM and tons of swap,  
> but
> am hitting the memory wall as occasionally memory needs grow to  
> ungodly
> proportions (in the early iterations the program starts out around 8GB
> but quickly grows to 15GB, then grows from there). This is not due  
> to my
> using the beta version, as this also arises under R-2.2.1 for what  
> that
> is worth.
>
> My question is whether or not there is some simple way to  
> substantially
> reduce the memory footprint for this procedure. I took a look at
> previous posts for step() and memory issues, but still wonder whether
> there might be a switch or possibly better way of constructing my  
> model
> that would overcome the memory issues.
>
> I include the code below, and any comments or suggestions would be  
> most
> welcome (besides `what type of idiot lets information criteria  
> determine
> their model ;-)')
>
> Thanks ever so much in advance.
>
> -- Jeff
>
> ---- Begin ----
>
> ## Read in the full data set (n=745466 observations)
>
> data <- read.table("../data_header.dat",header=TRUE)
>
> ## Create a data frame with all categorical variables declared as
> ## unordered factors
>
> data <- data.frame(logrprice=data$logrprice,
>                    cgt=factor(data$cgt),	
>                    cag=factor(data$cag),
>                    gstann=factor(data$gstann),
>                    fhogann=factor(data$fhogann),
>                    gstfhog=factor(data$gstfhog),
>                    luc=factor(data$luc),
>                    municipality=factor(data$municipality),
>                    time=factor(data$time),
>                    distance=data$distance,
>                    logr=data$logr,
>                    loginc=data$loginc)
>
> ## Estimate a simple linear model (used repeatedly in the literature,
> ## fails the most simple of model specification tests e.g.,
> ## resettest())
>
> model.linear <- lm(logrprice~.,data=data)
>
> ## Now conduct stepwise (BIC) regression using the step() function in
> ## the stats library. The lower model is the unconditional mean of y,
> ## the upper having polynomials of up to order 6 in the three
> ## continuous covariates, with interaction among all variables of
> ## order 2.
>
> n <- nrow(data)
>
> model.bic <- step(model.linear,
>                   scope=list(
>                     lower=~ 1,
>                     upper=~ (.
>                              +I(logr^2)
>                              +I(logr^3)
>                              +I(logr^4)
>                              +I(logr^5)
>                              +I(logr^6)
>                              +I(distance^2)
>                              +I(distance^3)
>                              +I(distance^4)
>                              +I(distance^5)
>                              +I(distance^6)
>                              +I(loginc^2)
>                              +I(loginc^3)
>                              +I(loginc^4)
>                              +I(loginc^5)
>                              +I(loginc^6))
>                     ^2),
>                   trace=TRUE,
>                   k=log(n)
>                   )
>
> summary(model.bic)
>
> ---- End ----
> -- 
> Professor J. S. Racine         Phone:  (905) 525 9140 x 23825
> Department of Economics        FAX:    (905) 521-8232
> McMaster University            e-mail: racinej at mcmaster.ca
> 1280 Main St. W.,Hamilton,     URL:
> http://www.economics.mcmaster.ca/racine/
> Ontario, Canada. L8S 4M4
>
> `The generation of random numbers is too important to be left to
> chance.'
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting- 
> guide.html