[R] More efficient option to append()?

Dennis Murphy djmuser at gmail.com
Fri Aug 19 05:11:39 CEST 2011


Hi:

This seems to take a bit less code, avoids explicit loops (by using
mapply() instead, where the loops are internal) and takes about 10
seconds on my system:

m <- cbind(x = sample(1:15,2000000, replace=T),
           y = sample(1:10*1000, 2000000, replace=T))
sum(m[, 1])
# [1] 16005804
ff <- function(x, y) rep(y, x)
system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2])))
   user  system elapsed
   9.75    0.00    9.75

> length(w)
[1] 16005804
> count(w)
       x    freq
1   1000 1603184
2   2000 1590599
3   3000 1596661
4   4000 1607112
5   5000 1598571
6   6000 1599195
7   7000 1600475
8   8000 1601718
9   9000 1598896
10 10000 1609393

HTH,
Dennis

PS: It would have been a good idea to keep the OP in the loop of this thread.

On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates
<timothy.c.bates at gmail.com> wrote:
> This takes a few seconds to do 1 million lines, and remains explicit/for loop form
>
> numberofSalaryBands = 1000000 # 2000000
> x        = sample(1:15,numberofSalaryBands, replace=T)
> y        = sample((1:10)*1000, numberofSalaryBands, replace=T)
> df       = data.frame(x,y)
> finalN   = sum(df$x)
> myVar    = rep(NA, finalN)
> outIndex = 1
> i        = 1
> for (i in 1:numberofSalaryBands) {
>        kount = df$x[i]
>        myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i]
>        outIndex = outIndex+kount
> }
> head(myVar)
> plyr::count(myVar)
>
>
> On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
>
>>
>>
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>>      x     y
>> [1,] 10  4000
>> [2,]  3  1000
>> [3,]  3  4000
>> [4,]  8  6000
>> [5,]  2  9000
>> [6,]  3  8000
>> [7,]  2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list