[R] Fast function for centering and standardizing variables
Krzysztof Sakrejda-Leavitt
krzysztof.sakrejda at gmail.com
Mon Jun 1 15:11:59 CEST 2009
I wrote a function to center variables I use in regression and
standardize them by the standard deviation (below) within certain
groupings (much like the aggregate function can apply a function to
groups). This runs fast enough when I have about 50 groups and 50k
records, but sometimes I end up with 1000 groups or so and it slows down
considerably. The problem is probably the 'for' loops at the group
level but I am having a hard time seeing if there is a good way to
vectorize that step. Alternatively, is there a fast function already
implemented for this sort of thing?
If you want to run the function on a test data frame (from package
MASS), here's the syntax:
zscore(data = UScereal, columns = c("calories","protein","sugars"), by =
list(mfr = UScereal$mfr, vitamins = UScereal$vitamins))
It returns a data frame with new columns appended.
zscore <- function(data, columns, by) {
means <- aggregate(x = data[,columns], by = by, FUN = mean, na.rm=T)
sdevs <- aggregate(x = data[,columns], by = by, FUN = sd, na.rm=T)
# Efficient (?) index for 'na' in any 'by' column. NA => FALSE
noNA <- (rowSums(is.na(as.data.frame(by))) == 0)
for (col in columns) {
# Final name for the new column.
column <- paste(col,"CMS",sep="")
for (i in 1:nrow(means)) {
# Allocate objects for indexing on 'by' terms.
byTFmean <- by
byTFsd <- by
for (j in names(by)) {
# Construct index for each 'by' term
byTFmean[[j]] <- !(data[[j]] == means[[j]][[i]])
byTFsd[[j]] <- !(data[[j]] == sdevs[[j]][[i]])
# collapse indexes for 'by' using '&'
byTFmean <- (rowSums(as.data.frame(byTFmean)) == 0)
byTFsd <- (rowSums(as.data.frame(byTFsd)) == 0)
data[[column]][noNA & byTFmean & byTFsd] <- ( data[[col]][noNA &
byTFmean & byTFsd] - means[[col]][i] ) / sdevs[[col]][i]
Any suggestions are welcome and I'm happy to post back the final code.
Krzysztof Sakrejda-Leavitt
Organismic and Evolutionary Biology
University of Massachusetts, Amherst
319 Morrill Science Center South
611 N. Pleasant Street
Amherst, MA 01003
work #: 413-325-6555
email: sakrejda at nsm.umass.edu
More information about the R-help
mailing list