[R] Vectorizing a for-loop for cross-validation in R
Berry, Charles
ccberry @end|ng |rom uc@d@edu
Wed Jan 23 19:34:14 CET 2019
See inline.
> On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <aleksandre.gavashelishvili using iliauni.edu.ge> wrote:
>
> I'm trying to speed up a script that otherwise takes days to handle larger
> data sets. So, is there a way to completely vectorize or paralellize the
> following script:
>
> *# k-fold cross validation*
>
> df <- trees # a data frame 'trees' from R.
> df <- df[sample(nrow(df)), ] # randomly shuffles the data.
> k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
> validation.
> folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
> unique numbers for k equally size folds.
> df$ID <- folds # adds fold IDs.
> df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1"
> "pred2" "pred3" to speed up the following loop.
>
> library(mgcv)
>
Rprof()
replicate(100, {
> for(i in 1:k) {
> # looping for different models:
> m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
> m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
> m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
>
> # looping for predictions:
> df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
> df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
> df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
> }
>
})
Rprof(NULL)
summaryRprof()
## read ?Rprof to get a sense of what it does
## read the summary to determine where time is being spent.
## the result was surprising to me. YMMV.
## there may be redundancies that you can eliminate by
## - doing the setup within gam() one time and saving it
## - calling the worker functions by modifying the setup
## in a loop or function and saving the results
> # calculating residuals:
> df$res1 <- with(df, Volume - pred1)
> df$res2 <- with(df, Volume - pred2)
> df$res3 <- with(df, Volume - pred3)
>
> Model <- paste("m", 1:3, sep="") # creates a vector of model names.
>
> # creating a vector of mean-square errors (MSE):
> MSE <- with(df, c(
> sum(res1^2) / nrow(df),
> sum(res2^2) / nrow(df),
> sum(res3^2) / nrow(df)
> ))
>
> model.mse <- data.frame(Model, MSE) # creates a data frame of model names
> and mean-square errors.
> model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
> data frame in order of increasing mean-square errors.
>
> I'd appreciate any help. This code takes several days if run on >=30,000
> different GAM models and 3 predictors. Could you please help with
> re-writing the script into sapply() or foreach()/doParallel format?
>
This is something you should learn to do. It is pretty standard practice. Use the body of your for loop as the body of a function, add arguments, and create a suitable return value. The something like
lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
should work. If it does, then parallel::mclapply(...) should also work.
HTH,
Chuck
More information about the R-help
mailing list