[R] Guidance on step() with large dataset (750K) solicited...
roger koenker
rkoenker at uiuc.edu
Thu Apr 13 21:57:00 CEST 2006
Jeff,
I don't know whether this is likely to be feasible, but if you could
replace calls to lm() with calls to a sparse matrix version of lm()
either slm() in SparseM or something similar in Matrix, then I
would think that you should safe from memory problems. Adapting step
might be more than you really bargained for though, I don't
know the code....
Roger
On Apr 13, 2006, at 2:41 PM, Jeffrey Racine wrote:
> Hi.
>
> Background - I am working with a dataset involving around 750K
> observations, where many of the variables (8/11) are unordered
> factors.
>
> The typical model used to model this relationship in the literature
> has
> been a simple linear additive model, but this is rejected out of
> hand by
> the data. I was asked to model this via kernel methods, but first
> wanted
> to play with the parametric specification out of curiosity.
>
> I thought it would be interesting to see what type of model
> stepwise BIC
> would yield, and have been playing with the step() function (on R-beta
> due to the factor.scope() problem that has been fixed in the
> patched and
> beta version).
>
> I am running this on a 64bit box with 32GB of RAM and tons of swap,
> but
> am hitting the memory wall as occasionally memory needs grow to
> ungodly
> proportions (in the early iterations the program starts out around 8GB
> but quickly grows to 15GB, then grows from there). This is not due
> to my
> using the beta version, as this also arises under R-2.2.1 for what
> that
> is worth.
>
> My question is whether or not there is some simple way to
> substantially
> reduce the memory footprint for this procedure. I took a look at
> previous posts for step() and memory issues, but still wonder whether
> there might be a switch or possibly better way of constructing my
> model
> that would overcome the memory issues.
>
> I include the code below, and any comments or suggestions would be
> most
> welcome (besides `what type of idiot lets information criteria
> determine
> their model ;-)')
>
> Thanks ever so much in advance.
>
> -- Jeff
>
> ---- Begin ----
>
> ## Read in the full data set (n=745466 observations)
>
> data <- read.table("../data_header.dat",header=TRUE)
>
> ## Create a data frame with all categorical variables declared as
> ## unordered factors
>
> data <- data.frame(logrprice=data$logrprice,
> cgt=factor(data$cgt),
> cag=factor(data$cag),
> gstann=factor(data$gstann),
> fhogann=factor(data$fhogann),
> gstfhog=factor(data$gstfhog),
> luc=factor(data$luc),
> municipality=factor(data$municipality),
> time=factor(data$time),
> distance=data$distance,
> logr=data$logr,
> loginc=data$loginc)
>
> ## Estimate a simple linear model (used repeatedly in the literature,
> ## fails the most simple of model specification tests e.g.,
> ## resettest())
>
> model.linear <- lm(logrprice~.,data=data)
>
> ## Now conduct stepwise (BIC) regression using the step() function in
> ## the stats library. The lower model is the unconditional mean of y,
> ## the upper having polynomials of up to order 6 in the three
> ## continuous covariates, with interaction among all variables of
> ## order 2.
>
> n <- nrow(data)
>
> model.bic <- step(model.linear,
> scope=list(
> lower=~ 1,
> upper=~ (.
> +I(logr^2)
> +I(logr^3)
> +I(logr^4)
> +I(logr^5)
> +I(logr^6)
> +I(distance^2)
> +I(distance^3)
> +I(distance^4)
> +I(distance^5)
> +I(distance^6)
> +I(loginc^2)
> +I(loginc^3)
> +I(loginc^4)
> +I(loginc^5)
> +I(loginc^6))
> ^2),
> trace=TRUE,
> k=log(n)
> )
>
> summary(model.bic)
>
> ---- End ----
> --
>
> `The generation of random numbers is too important to be left to
> chance.'
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html
