[R] Fast function for centering and standardizing variables

Krzysztof Sakrejda-Leavitt krzysztof.sakrejda at gmail.com
Mon Jun 1 15:11:59 CEST 2009


Hi,

I wrote a function to center variables I use in regression and
standardize them by the standard deviation (below) within certain
groupings (much like the aggregate function can apply a function to
groups).  This runs fast enough when I have about 50 groups and 50k
records, but sometimes I end up with 1000 groups or so and it slows down
considerably.  The problem is probably the 'for' loops at the group
level but I am having a hard time seeing if there is a good way to
vectorize that step.  Alternatively, is there a fast function already
implemented for this sort of thing?

If you want to run the function on a test data frame (from package
MASS), here's the syntax:

library(MASS)
zscore(data = UScereal, columns = c("calories","protein","sugars"), by =
list(mfr = UScereal$mfr, vitamins = UScereal$vitamins))

It returns a data frame with new columns appended.

------------------
zscore <- function(data, columns, by) {
  means <- aggregate(x = data[,columns], by = by, FUN = mean, na.rm=T)
  sdevs <- aggregate(x = data[,columns], by = by, FUN = sd, na.rm=T)
  # Efficient (?) index for 'na' in any 'by' column. NA => FALSE
  noNA <- (rowSums(is.na(as.data.frame(by))) == 0)

  for (col in columns) {
    # Final name for the new column.
    column <- paste(col,"CMS",sep="")
    for (i in 1:nrow(means)) {
      # Allocate objects for indexing on 'by' terms.
      byTFmean <- by
      byTFsd <- by
      for (j in names(by)) {
          # Construct index for each 'by' term
          byTFmean[[j]] <- !(data[[j]] == means[[j]][[i]])
          byTFsd[[j]] <- !(data[[j]] == sdevs[[j]][[i]])
      }
      # collapse indexes for 'by' using '&'
      byTFmean <- (rowSums(as.data.frame(byTFmean)) == 0)
      byTFsd <- (rowSums(as.data.frame(byTFsd)) == 0)
      data[[column]][noNA & byTFmean & byTFsd] <- ( data[[col]][noNA &
byTFmean & byTFsd] - means[[col]][i] ) / sdevs[[col]][i]
    }
  }
  return(data)
}
------------------------

Any suggestions are welcome and I'm happy to post back the final code.

Best,

Krzysztof


-----------------------------------------------
Krzysztof Sakrejda-Leavitt

Organismic and Evolutionary Biology
University of Massachusetts, Amherst
319 Morrill Science Center South
611 N. Pleasant Street
Amherst, MA 01003

work #: 413-325-6555
email: sakrejda at nsm.umass.edu




More information about the R-help mailing list