[R] More efficient option to append()?

Paul Hiemstra paul.hiemstra at knmi.nl
Fri Aug 19 15:50:53 CEST 2011


 On 08/17/2011 10:53 PM, Alex Ruiz Euler wrote:
> Dear R community,
>
> I have a 2 million by 2 matrix that looks like this:
>
> x<-sample(1:15,2000000, replace=T)
> y<-sample(1:10*1000, 2000000, replace=T)
>       x     y
> [1,] 10  4000
> [2,]  3  1000
> [3,]  3  4000
> [4,]  8  6000
> [5,]  2  9000
> [6,]  3  8000
> [7,]  2 10000
> (...)
>
>
> The first column is a population expansion factor for the number in the
> second column (household income). I want to expand the second column
> with the first so that I end up with a vector beginning with 10
> observations of 4000, then 3 observations of 1000 and so on. In my mind
> the natural approach would be to create a NULL vector and append the
> expansions:
>
> myvar<-NULL
> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>
> for (i in 2:length(x)) {
> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> }
>
> to end with a vector of sum(x), which in my real database corresponds
> to 22 million observations.
>
> This works fine --if I only run it for the first, say, 1000
> observations. If I try to perform this on all 2 million observations
> it takes long, way too long for this to be useful (I left it running
> 11 hours yesterday to no avail).
>
>
> I know R performs well with operations on relatively large vectors. Why
> is this so inefficient? And what would be the smart way to do this?

Hi Alex,

The other reply already gave you the R way of doing this while avoiding
the for loop. However, there is a more general reason why your for loop
is terribly inefficient. A small set of examples:

largeVector = runif(10e4)
outputVector = NULL
system.time(for(i in 1:length(largeVector)) {
    outputVector = append(outputVector, largeVector[i] + 1)
})
#   user  system elapsed
 # 6.591   0.168   6.786

The problem in this code is that outputVector keeps on growing and
growing. The operating system needs to allocate more and more space as
the object grows. This process is really slow. Several (much) faster
alternatives exist:

# Pre-allocating the outputVector
outputVector = rep(0,length(largeVector))
system.time(for(i in 1:length(largeVector)) {
    outputVector[i] = largeVector[i] + 1
})
#   user  system elapsed
# 0.178   0.000   0.178
# speed up of 37 times, this will only increase for large
# lengths of largeVector

# Using apply functions
system.time(outputVector <- sapply(largeVector, function(x) return(x + 1)))
#   user  system elapsed
#  0.124   0.000   0.125
# Even a bit faster

# Using vectorisation
system.time(outputVector <- largeVector + 1)
#   user  system elapsed
#  0.000   0.000   0.001
# Practically instant, 6780 times faster than the first example

It is not always clear which method is most suitable and which performs
best. At least they all perform much, much better than the naive option
of letting outputVector grow.

cheers,
Paul

> Thanks in advance.
> Alex
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770



More information about the R-help mailing list