[R] multi-column factor
Rui Barradas
ruipbarradas at sapo.pt
Sun Sep 16 19:26:47 CEST 2012
Hello,
The obvious simplification is to call union() only once. With 10M rows
it should save time.
Then I've asked myself whether unique() wouldn't be faster.
f1 <- function(x){
x[[1]] <- factor(x[[1]], levels = union(x[[1]], x[[2]]))
x[[2]] <- factor(x[[2]], levels = union(x[[1]], x[[2]]))
x
}
f2 <- function(x){
levels <- union(x[[1]], x[[2]])
x[[1]] <- factor(x[[1]], levels = levels)
x[[2]] <- factor(x[[2]], levels = levels)
x
}
f3 <- function(x){
levels <- unique(c(x[[1]], x[[2]]))
x[[1]] <- factor(x[[1]], levels = levels)
x[[2]] <- factor(x[[2]], levels = levels)
x
}
set.seed(5467)
n <- 1e7
z <- data.frame(a = sample(letters[1:3], n, TRUE),
b = sample(letters[2:4], n, TRUE),
stringsAsFactors=FALSE)
t1 <- system.time(z1 <- f1(z))
t2 <- system.time(z2 <- f2(z))
t3 <- system.time(z3 <- f3(z))
identical(z1, z2) #[1] TRUE
identical(z1, z3) #[1] TRUE
rbind(t1, t2, t3)
user.self sys.self elapsed user.child sys.child
t1 2.55 0.47 3.01 NA NA
t2 1.57 0.29 1.87 NA NA
t3 1.51 0.26 1.78 NA NA
Hope this helps,
Rui Barradas
Em 16-09-2012 17:46, Sam Steingold escreveu:
> I have a data frame with columns which draw on the same underlying
> universe, so I want them to be factors with the same level set:
>
> --8<---------------cut here---------------start------------->8---
>> z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE)
>> str(z)
> 'data.frame': 3 obs. of 2 variables:
> $ a: chr "a" "b" "c"
> $ b: chr "b" "c" "d"
>> z$a <- factor(z$a,levels=union(z$a,z$b))
>> z$b <- factor(z$b,levels=union(z$a,z$b))
>> str(z)
> 'data.frame': 3 obs. of 2 variables:
> $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3
> $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4
> --8<---------------cut here---------------end--------------->8---
> factor(z$a,levels=union(z$a,z$b))
> is factor(z$a,levels=union(z$a,z$b)) the right way to handle this?
> maybe there is a better way to extract levels than union()?
> (bear in mind that I have ~10M rows and ~1M levels, so performance is an
> issue).
>
> Thanks!
>
More information about the R-help
mailing list