[R] Which is more efficient?

Fri Aug 5 09:07:38 CEST 2011

Hi:

Your question about efficiency does not seem well-posed to me.
Efficient relative to what criterion?
Rather than to address your question directly, I'll show how different
possible situations that could arise in the general context of your
problem can be addressed.

One of the first rules in R programming is to learn the concepts of
vectorization and indexing. This saves a lot of code down the line. R
is not C(++) or Java, and it shouldn't be programmed as though it
were. As a result, iterative approaches to problem solving in R are
usually, but not always, inefficient. R has many vectorized functions
which should be used whenever possible. Usually, the apply family of
functions or one of the summarization packages (notably data.table,
doBy and plyr, although there are others) can be exploited to
recursively apply a function to different subsets of data. Consider
three different situations below in which one might want to apply a
t-test. Only one uses iteration. I'm using the plyr package because it
is most flexible in terms of the types of input and output objects it
can process.

Let's start by manufacturing some matrix data:

## function to generate a matrix
mgen <- function() matrix(rnorm(50), nrow = 10)
## use replicate() to generate an array
marr <- replicate(4, mgen())   # a 10 x 5 x 4 array
marr

# A matrix of column indices to use in t.test()
tcols <- matrix(c(1, 2, 1, 3, 1, 4, 1, 5), ncol = 2, byrow = TRUE)
colnames(tcols) <- c('i', 'j')
tcols

# ------------------------
# Situation 1: multiple matrices, test the same pair
#              of columns in each, in this case 2 and 4.

# The input argument m is a matrix. A data frame is
# returned because that's what the adply() function in
# the plyr package expects as output (a = array input,
# d = data frame output)
tfun1 <- function(m) {
   v <- t.test(m[, 2], m[, 4], var.equal = TRUE)
   data.frame(tstat = v$statistic, pval = v$p.value)
  }

# adply takes the input array marr, iterates over the third index
# and applies tfun1 to each marginal matrix
res1 <- adply(marr, 3, tfun1)
res1

# ------------------------
# Situation 2: one matrix, test multiple pairs of columns

mat <- mgen()    # generate a single matrix
tfun2 <- function(i, j) {
    v <- t.test(mat[, i], mat[, j], var.equal = TRUE)
    data.frame(tstat = v$statistic, pval = v$p.value)
  }

# mdply() takes the matrix of column indices as its first
# argument. Notice that tfun2 was written so that its
# arguments are i and j, the column names of tcols.
# This is required, and the order matters. For each
# row of tcols, the function tfun2 is applied to the
# matrix mat.
res2 <- mdply(tcols, tfun2)
res2

# -------------------
# Situation 3: n matrices, different pairs of columns
#              tested in each

# The idea is to perform a t-test on different pairs of
# columns in each submatrix of marr.

# The simplest thing to do in this situation is to
# iterate, although there is probably some clever way to
# do this using nested apply family calls. The reason for
# iteration is that we want to operate on the same
# relevant index of *both* marr and tcols. It's possible to
# use mapply() for this task, but that would take more
# explanation and this is long-winded enough.

outmat <- matrix(NA, nrow = nrow(tcols), ncol = 4)
for(k in seq_len(nrow(tcols))) {
   mat <- marr[, , k]      # take k-th submatrix of marr
   cols <- tcols[k, ]       # take k-th row of tcols
   v <- t.test(mat[, cols[1]], mat[, cols[2]], var.equal = TRUE)
   outmat[k, ] <- c(cols[1], cols[2], v$statistic, v$p.value)
  }
colnames(outmat) <- c('col1', 'col2', 'tstat', 'pval')
outmat

Notice that the type of input matters, so the way in which the data
are arranged has much to do with the way you program in R, especially
with the apply family of functions and their offshoots in different
packages. The basic programming strategy is to write a utility
function that works for a generic subset of the input data, and then
use one of the **ply() functions or functions in the apply family to
map the function to different data subsets.

HTH,
Dennis

On Thu, Aug 4, 2011 at 8:19 PM, Matt Curcio <matt.curcio.ri at gmail.com> wrote:
> Greetings all,
> I am curious to know if either of these two sets of code is more efficient?
>
> Example1:
>  ## t-test ##
> colA <- temp [ , j ]
> colB <- temp [ , k ]
> ttr <- t.test ( colA, colB, var.equal=TRUE)
> tt_pvalue [ i ] <- ttr$p.value
>
> or
> Example2:
> tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ], var.equal=TRUE)
> -------------
> I have three loops, i, j, k.
> One to test the all of <i> files in a directory.  One to tease out
> column <j> and compare it by means of t-test to column <k> in each of
> the files.
> ---------------
> for ( i in 1:num_files ) {
>   temp <- read.table ( files_to_test [ i ], header=TRUE, sep="\t")
>   num_cols <- ncol ( temp )
>   ## Define Columns To Compare ##
>   for ( j in 2 : num_cols ) {
>      for ( k in 3 : num_cols ) {
>          ## t-test ##
>          colA <- temp [ , j ]
>          colB <- temp [ , k ]
>          ttr <- t.test ( colA, colB, var.equal=TRUE)
>          tt_pvalue [ i ] <- ttr$p.value
>      }
>   }
> }
> --------------------------------
> I am a novice writer of code and am interested to hear if there are
> any (dis)advantages to one way or the other.
> M
>
>
> Matt Curcio
> M: 401-316-5358
> E: matt.curcio.ri at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>