[R] Fast function for centering and standardizing variables
Greg Snow
Greg.Snow at imail.org
Mon Jun 1 18:58:12 CEST 2009
?scale
?ave
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Krzysztof Sakrejda-Leavitt
> Sent: Monday, June 01, 2009 7:12 AM
> To: r-help at r-project.org
> Subject: [R] Fast function for centering and standardizing variables
>
> Hi,
>
> I wrote a function to center variables I use in regression and
> standardize them by the standard deviation (below) within certain
> groupings (much like the aggregate function can apply a function to
> groups). This runs fast enough when I have about 50 groups and 50k
> records, but sometimes I end up with 1000 groups or so and it slows
> down
> considerably. The problem is probably the 'for' loops at the group
> level but I am having a hard time seeing if there is a good way to
> vectorize that step. Alternatively, is there a fast function already
> implemented for this sort of thing?
>
> If you want to run the function on a test data frame (from package
> MASS), here's the syntax:
>
> library(MASS)
> zscore(data = UScereal, columns = c("calories","protein","sugars"), by
> =
> list(mfr = UScereal$mfr, vitamins = UScereal$vitamins))
>
> It returns a data frame with new columns appended.
>
> ------------------
> zscore <- function(data, columns, by) {
> means <- aggregate(x = data[,columns], by = by, FUN = mean, na.rm=T)
> sdevs <- aggregate(x = data[,columns], by = by, FUN = sd, na.rm=T)
> # Efficient (?) index for 'na' in any 'by' column. NA => FALSE
> noNA <- (rowSums(is.na(as.data.frame(by))) == 0)
>
> for (col in columns) {
> # Final name for the new column.
> column <- paste(col,"CMS",sep="")
> for (i in 1:nrow(means)) {
> # Allocate objects for indexing on 'by' terms.
> byTFmean <- by
> byTFsd <- by
> for (j in names(by)) {
> # Construct index for each 'by' term
> byTFmean[[j]] <- !(data[[j]] == means[[j]][[i]])
> byTFsd[[j]] <- !(data[[j]] == sdevs[[j]][[i]])
> }
> # collapse indexes for 'by' using '&'
> byTFmean <- (rowSums(as.data.frame(byTFmean)) == 0)
> byTFsd <- (rowSums(as.data.frame(byTFsd)) == 0)
> data[[column]][noNA & byTFmean & byTFsd] <- ( data[[col]][noNA &
> byTFmean & byTFsd] - means[[col]][i] ) / sdevs[[col]][i]
> }
> }
> return(data)
> }
> ------------------------
>
> Any suggestions are welcome and I'm happy to post back the final code.
>
> Best,
>
> Krzysztof
>
>
> -----------------------------------------------
> Krzysztof Sakrejda-Leavitt
>
> Organismic and Evolutionary Biology
> University of Massachusetts, Amherst
> 319 Morrill Science Center South
> 611 N. Pleasant Street
> Amherst, MA 01003
>
> work #: 413-325-6555
> email: sakrejda at nsm.umass.edu
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list