[R] More efficient way to use ifelse()? - A follow up
Ian Dworkin
idworkin at msu.edu
Wed May 26 21:04:17 CEST 2010
# Thanks again to everyone who provided suggestions.
# I was curious about which approaches would be the fastest... so a
little benchmarking
# My approach was by far the worst :)
# The approach suggested by Duncan Murdoch and Peter Langfelder, based
on indexing , was by far the fastest (~ 66times faster than using
nested ifelse() ). All the details can be found below for those who
are interested. I found it interesting that the variant by Peter
Langfelder was somewhat slower, given that the only difference was
explicitly defining the class in the index. What is the speed cost for
this: O(n) or O(1)?
# I have one additional question. I would have guessed that
initializing an empty vector of the right size would have sped up the
subsequent operation, filling that vector, but it does not seem to
have much of an effect. Any thoughts?
# i.e. using
N <- 6000000 # number of observations
elevation <- rep(NA, length(Population)) # This does not really speed
things up much.
#####
Population <- gl( n=6, k=5,length=N, labels =c("Ga", "CO", "CN","KO",
"Ng", "Mw"))
# You would like to assign a particular value to each level of
population (in this case the elevation at which they were collected).
In a vectorized approach (for speed... pretend this was a big data
set..)
elevation <- rep(NA, length(Population)) # Just to make a vector of
the right size, to speed up filling it. In practice it does not seem
to speed things up.
# My original approach
system.time(
elevation <- ifelse(Population=="CO", 2169,
ifelse(Population=="CN", 1121,
ifelse(Population=="Ga", 500,
ifelse(Population=="KO", 2500,
ifelse(Population=="Mw", 625,
ifelse(Population=="Ng", 300, NA ))))))
)
#elapsed ~ 12s... by far the slowest approach!!!!
# Suggestions
#Peter Langfelder
values = c(500, 2169, 1121, 2500, 300, 625)
system.time( elevation.PL <- values[as.numeric(factor(Population))] ) # ~ 0.85s
# Values need to be in the order in which the levels of the factor are sorted
#i.e. Pop2 <- rep(c("Ga", "CO", "CN", "Ng", "KO", "Mw"), 10)
# levels(factor(Pop2)) would not work.
#or
codeToElev = data.frame(codes = c("CO", "CN","Ga","KO", "Mw", "Ng"),
elev = c(2169, 1121,
500, 2500, 625, 300))
system.time(
elevation.PL.2 <- codeToElev$elev[match(Population, codeToElev$codes)]
)
# ~ 0.5s elapsed
# Duncan Murdoch suggested
#In a case like this, often indexing is clearer than ifelse. For example,
results <- c(CN=1121, CO=2169, Ga = 500, KO=2500, Mw = 625, Ng = 300)
system.time (
elevation.DM <- results[Population]
)
# 0.181s elapsed
#One followup: don't do this if Population is a factor. It will
index by the numeric values rather than the labels. In this example
you should get the same answer since the labels in "results" are in
alphabetical order, but you won't in general.
#Generally vector indexing of atomic vectors and matrices is very
fast; indexing of data frames is much slower, so if speed is an issue,
avoid them.
# Jorge Ivan Velez suggests looking at recode in the car package.
require(car)
system.time(
elevation.JIV <- recode(Population, " 'CN'=1121; 'CO'=2169; 'Ga' =
500; 'KO' = 2500; 'Mw' = 625; 'Ng' = 300 ", as.factor.result=F)
)
# ~ 3.5s elapsed
# David Winsemius suggests
system.time(
elevation.DW <- (Population=="CO")* 2169+
(Population=="CN")* 1121+
(Population=="Ga")* 500+
(Population=="KO")* 2500+
(Population=="Mw")* 625+
(Population=="Ng")* 300
)
# ~ 3.2s elapsed
#Jeff Newmiller suggested using merge.. not implemented
# Dennis Murphy suggested switch.. I have not gotten it working yet..
elevation.DM <- switch(Population, "CO"= 2169, "CN" = 1121, "Ga" =
500, "KO" = 2500, "Mw" = 625, "Ng" = 300 )
On 26 May 2010 01:25, Ian Dworkin <idworkin at msu.edu> wrote:
> # This is more about trying to find a more effecient way to code some
> simple vectorized computations using ifelse().
>
> # Say you have some vector representing a factor with a number of
> levels (6 in this case), representing the location that samples were
> collected.
>
> Population <- gl( n=6, k=5,length=120, labels =c("CO", "CN","Ga","KO",
> "Mw", "Ng"))
>
>
> # You would like to assign a particular value to each level of
> population (in this case the elevation at which they were collected).
> In a vectorized approach (for speed... pretend this was a big data
> set..)
>
> elevation <- ifelse(Population=="CO", 2169,
> ifelse(Population=="CN", 1121,
> ifelse(Population=="Ga", 500,
> ifelse(Population=="KO", 2500,
> ifelse(Population=="Mw", 625,
> ifelse(Population=="Ng", 300, NA ))))))
>
> # Which is fine, but is a pain to write...
>
> # So I was trying to think about how to vectorize directly. i.e use
> vectors within the test, and for return values for T and F
>
> elevation.take.2 <- ifelse(Population==c("CO", "CN", "Ga", "KO",
> "Mw", "Ng"), c(2169, 1121, 500, 2500, 625, 300), c(NA, NA, NA, NA, NA,
> NA))
>
> # It makes sense to me why this does not work (elevation.take.2), but
> I am not sure how to get it to work. Any suggestions? I suspect it
> involves a trick using "any" or "II" or something, but I can't seem to
> work it out.
>
>
> # Thanks in advance!
>
> # Ian Dworkin
> # idworkin at msu.edu
>
--
Ian Dworkin
Assistant Professor
Department of Zoology
Program in Ecology, Evolutionary Biology & Behaviour
Program in Genetics
Michigan State University
office (517) 432-6733
lab (517) 432-6730
idworkin at msu.edu
https://www.msu.edu/~idworkin/
More information about the R-help
mailing list