[R] Guidance on step() with large dataset (750K) solicited...
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Apr 14 09:41:54 CEST 2006
On Thu, 13 Apr 2006, roger koenker wrote:
> I don't know whether this is likely to be feasible, but if you could
> replace calls to lm() with calls to a sparse matrix version of lm()
> either slm() in SparseM or something similar in Matrix, then I
> would think that you should safe from memory problems. Adapting step
> might be more than you really bargained for though, I don't
> know the code....
It's a simple wrapper that has been used for many model-fitting classes.
All you need is an extractAIC method.
> url: www.econ.uiuc.edu/~roger Roger Koenker
> email rkoenker at uiuc.edu Department of Economics
> vox: 217-333-4558 University of Illinois
> fax: 217-244-6678 Champaign, IL 61820
> On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:
>> Background - I am working with a dataset involving around 750K
>> observations, where many of the variables (8/11) are unordered
>> The typical model used to model this relationship in the literature
>> been a simple linear additive model, but this is rejected out of
>> hand by
>> the data. I was asked to model this via kernel methods, but first
>> to play with the parametric specification out of curiosity.
>> I thought it would be interesting to see what type of model
>> stepwise BIC
>> would yield, and have been playing with the step() function (on R-beta
>> due to the factor.scope() problem that has been fixed in the
>> patched and
>> beta version).
>> I am running this on a 64bit box with 32GB of RAM and tons of swap,
>> am hitting the memory wall as occasionally memory needs grow to
>> proportions (in the early iterations the program starts out around 8GB
>> but quickly grows to 15GB, then grows from there). This is not due
>> to my
>> using the beta version, as this also arises under R-2.2.1 for what
>> is worth.
>> My question is whether or not there is some simple way to
>> reduce the memory footprint for this procedure. I took a look at
>> previous posts for step() and memory issues, but still wonder whether
>> there might be a switch or possibly better way of constructing my
>> that would overcome the memory issues.
>> I include the code below, and any comments or suggestions would be
>> welcome (besides `what type of idiot lets information criteria
>> their model ;-)')
>> Thanks ever so much in advance.
>> -- Jeff
>> ---- Begin ----
>> ## Read in the full data set (n=745466 observations)
>> data <- read.table("../data_header.dat",header=TRUE)
>> ## Create a data frame with all categorical variables declared as
>> ## unordered factors
>> data <- data.frame(logrprice=data$logrprice,
>> ## Estimate a simple linear model (used repeatedly in the literature,
>> ## fails the most simple of model specification tests e.g.,
>> ## resettest())
>> model.linear <- lm(logrprice~.,data=data)
>> ## Now conduct stepwise (BIC) regression using the step() function in
>> ## the stats library. The lower model is the unconditional mean of y,
>> ## the upper having polynomials of up to order 6 in the three
>> ## continuous covariates, with interaction among all variables of
>> ## order 2.
>> n <- nrow(data)
>> model.bic <- step(model.linear,
>> lower=~ 1,
>> upper=~ (.
>> ---- End ----
>> Professor J. S. Racine Phone: (905) 525 9140 x 23825
>> Department of Economics FAX: (905) 521-8232
>> McMaster University e-mail: racinej at mcmaster.ca
>> 1280 Main St. W.,Hamilton, URL:
>> Ontario, Canada. L8S 4M4
>> `The generation of random numbers is too important to be left to
>> R-help at stat.math.ethz.ch mailing list
>> PLEASE do read the posting guide! http://www.R-project.org/posting-
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help