[R] splitting a factor column into binary columns for each factor

Tue Jan 26 21:12:59 CET 2010

Yesterday I posted the following question (my apologies for not putting a subject line):

=================question======================
Hello -- I would like to know of a more efficient way of writing the following piece of code. Thanks. 

options(stringsAsFactors=FALSE) 
orig <-  c(rep('11111111',100000),rep('22222222',200000),rep('33333333'  ,300000),rep('44444444',400000)) 
orig.unique <- unique(orig) 
system.time(df <- as.data.frame(sapply(orig.unique,  function(x) ifelse(orig==x, 1, 0))))
============================================

I received a response via e-mail which was **extremely** useful.

=================answer======================
Using sapply instead of lapply here is a waste.  sapply() calls lapply(), which returns a list that sapply() turns into a list by making each list element a column of the matrix.  data.frame(matrix) then makes a list from the columns of the matrix.

The one thing that sapply gives you and lapply doesn't is column names.  If you attach names to orig.unique then lapply's output will have them. 

Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x).  I wrote functions g0 (containing your code), g1 (using lapply), and g2 (ifelse->as.numeric).  I parameterized them by the number of '1111111' elements and they each return the data.frame created and the time it took to do it:

> g0
function(n = 1e+05) { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) 
  orig.unique <- unique(orig) 
  time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) 
  list(time = time, df = df) 
} 

> g1
function (n = 1e+05)  { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) 
  orig.unique <- unique(orig)
  names(orig.unique) <- orig.unique 
  time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) ifelse(orig == x, 1, 0)))) 
  list(time = time, df = df) 
}

> g2 
function (n = 1e+05) { 
  orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n)) 
  orig.unique <- unique(orig) 
  names(orig.unique) <- orig.unique 
  time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) as.numeric(orig == x)))) 
  list(time = time, df = df) 
} 

For n=10^5 the times were 
> g0(1e5)$time 
   user  system elapsed 
  20.65    0.41   20.64 
> g1(1e5)$time 
   user  system elapsed 
   2.35    0.05    2.36 
> g2(1e5)$time 
   user  system elapsed 
   0.73    0.10    0.77 
and the data.frames each produced were identical. 

Another approach is to use outer() to make a matrix that gets passed to data.frame().  It seems slightly slower than g2, but small changes might make it faster.

> g3 
function (n = 1e+05) { 
    orig <- c(rep("11111111", n), rep("22222222", 2 * n), rep("33333333", 3 * n), rep("44444444", 4 * n)) 
    orig.unique <- unique(orig) 
    names(orig.unique) <- orig.unique 
    time <- system.time(df <- data.frame(check.names=FALSE, outer(orig, orig.unique, function(x, y) as.numeric(x==y)))) 
    list(time = time, df = df) 
}

> g3(1e5)$time 
   user  system elapsed 
   1.02    0.00    0.97 

When you want to optimize code it is often handy to write functions like this to do the timing for various problem sizes.  You can quickly experiment with small versions of the problem to make sure the results are correct and the time looks reasonable and later see if the times scale up as hoped to your desired problem size.