[R] splitting a factor column into binary columns for each factor
Chuck White
chuckwhite8 at charter.net
Tue Jan 26 21:12:59 CET 2010
Yesterday I posted the following question (my apologies for not putting a subject line):
=================question======================
Hello -- I would like to know of a more efficient way of writing the following piece of code. Thanks.
options(stringsAsFactors=FALSE)
orig <- c(rep('11111111',100000),rep('22222222',200000),rep('33333333' ,300000),rep('44444444',400000))
orig.unique <- unique(orig)
system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig==x, 1, 0))))
============================================
I received a response via e-mail which was **extremely** useful.
=================answer======================
Using sapply instead of lapply here is a waste. sapply() calls lapply(), which returns a list that sapply() turns into a list by making each list element a column of the matrix. data.frame(matrix) then makes a list from the columns of the matrix.
The one thing that sapply gives you and lapply doesn't is column names. If you attach names to orig.unique then lapply's output will have them.
Also ifelse(orig==x,1,0) slower than the equivalent as.numeric(orig==x). I wrote functions g0 (containing your code), g1 (using lapply), and g2 (ifelse->as.numeric). I parameterized them by the number of '1111111' elements and they each return the data.frame created and the time it took to do it:
> g0
function(n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
time <- system.time(df <- as.data.frame(sapply(orig.unique, function(x) ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g1
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) ifelse(orig == x, 1, 0))))
list(time = time, df = df)
}
> g2
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2*n), rep("33333333", 3*n), rep("44444444", 4*n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE, lapply(orig.unique, function(x) as.numeric(orig == x))))
list(time = time, df = df)
}
For n=10^5 the times were
> g0(1e5)$time
user system elapsed
20.65 0.41 20.64
> g1(1e5)$time
user system elapsed
2.35 0.05 2.36
> g2(1e5)$time
user system elapsed
0.73 0.10 0.77
and the data.frames each produced were identical.
Another approach is to use outer() to make a matrix that gets passed to data.frame(). It seems slightly slower than g2, but small changes might make it faster.
> g3
function (n = 1e+05) {
orig <- c(rep("11111111", n), rep("22222222", 2 * n), rep("33333333", 3 * n), rep("44444444", 4 * n))
orig.unique <- unique(orig)
names(orig.unique) <- orig.unique
time <- system.time(df <- data.frame(check.names=FALSE, outer(orig, orig.unique, function(x, y) as.numeric(x==y))))
list(time = time, df = df)
}
> g3(1e5)$time
user system elapsed
1.02 0.00 0.97
When you want to optimize code it is often handy to write functions like this to do the timing for various problem sizes. You can quickly experiment with small versions of the problem to make sure the results are correct and the time looks reasonable and later see if the times scale up as hoped to your desired problem size.
More information about the R-help
mailing list