[R] Fast function for centering and standardizing variables

Mon Jun 1 18:58:12 CEST 2009

?scale
?ave

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Krzysztof Sakrejda-Leavitt
> Sent: Monday, June 01, 2009 7:12 AM
> To: r-help at r-project.org
> Subject: [R] Fast function for centering and standardizing variables
> 
> Hi,
> 
> I wrote a function to center variables I use in regression and
> standardize them by the standard deviation (below) within certain
> groupings (much like the aggregate function can apply a function to
> groups).  This runs fast enough when I have about 50 groups and 50k
> records, but sometimes I end up with 1000 groups or so and it slows
> down
> considerably.  The problem is probably the 'for' loops at the group
> level but I am having a hard time seeing if there is a good way to
> vectorize that step.  Alternatively, is there a fast function already
> implemented for this sort of thing?
> 
> If you want to run the function on a test data frame (from package
> MASS), here's the syntax:
> 
> library(MASS)
> zscore(data = UScereal, columns = c("calories","protein","sugars"), by
> =
> list(mfr = UScereal$mfr, vitamins = UScereal$vitamins))
> 
> It returns a data frame with new columns appended.
> 
> ------------------
> zscore <- function(data, columns, by) {
>   means <- aggregate(x = data[,columns], by = by, FUN = mean, na.rm=T)
>   sdevs <- aggregate(x = data[,columns], by = by, FUN = sd, na.rm=T)
>   # Efficient (?) index for 'na' in any 'by' column. NA => FALSE
>   noNA <- (rowSums(is.na(as.data.frame(by))) == 0)
> 
>   for (col in columns) {
>     # Final name for the new column.
>     column <- paste(col,"CMS",sep="")
>     for (i in 1:nrow(means)) {
>       # Allocate objects for indexing on 'by' terms.
>       byTFmean <- by
>       byTFsd <- by
>       for (j in names(by)) {
>           # Construct index for each 'by' term
>           byTFmean[[j]] <- !(data[[j]] == means[[j]][[i]])
>           byTFsd[[j]] <- !(data[[j]] == sdevs[[j]][[i]])
>       }
>       # collapse indexes for 'by' using '&'
>       byTFmean <- (rowSums(as.data.frame(byTFmean)) == 0)
>       byTFsd <- (rowSums(as.data.frame(byTFsd)) == 0)
>       data[[column]][noNA & byTFmean & byTFsd] <- ( data[[col]][noNA &
> byTFmean & byTFsd] - means[[col]][i] ) / sdevs[[col]][i]
>     }
>   }
>   return(data)
> }
> ------------------------
> 
> Any suggestions are welcome and I'm happy to post back the final code.
> 
> Best,
> 
> Krzysztof
> 
> 
> -----------------------------------------------
> Krzysztof Sakrejda-Leavitt
> 
> Organismic and Evolutionary Biology
> University of Massachusetts, Amherst
> 319 Morrill Science Center South
> 611 N. Pleasant Street
> Amherst, MA 01003
> 
> work #: 413-325-6555
> email: sakrejda at nsm.umass.edu
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.