[R] More efficient option to append()?
Dennis Murphy
djmuser at gmail.com
Fri Aug 19 05:11:39 CEST 2011
Hi:
This seems to take a bit less code, avoids explicit loops (by using
mapply() instead, where the loops are internal) and takes about 10
seconds on my system:
m <- cbind(x = sample(1:15,2000000, replace=T),
y = sample(1:10*1000, 2000000, replace=T))
sum(m[, 1])
# [1] 16005804
ff <- function(x, y) rep(y, x)
system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2])))
user system elapsed
9.75 0.00 9.75
> length(w)
[1] 16005804
> count(w)
x freq
1 1000 1603184
2 2000 1590599
3 3000 1596661
4 4000 1607112
5 5000 1598571
6 6000 1599195
7 7000 1600475
8 8000 1601718
9 9000 1598896
10 10000 1609393
HTH,
Dennis
PS: It would have been a good idea to keep the OP in the loop of this thread.
On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates
<timothy.c.bates at gmail.com> wrote:
> This takes a few seconds to do 1 million lines, and remains explicit/for loop form
>
> numberofSalaryBands = 1000000 # 2000000
> x = sample(1:15,numberofSalaryBands, replace=T)
> y = sample((1:10)*1000, numberofSalaryBands, replace=T)
> df = data.frame(x,y)
> finalN = sum(df$x)
> myVar = rep(NA, finalN)
> outIndex = 1
> i = 1
> for (i in 1:numberofSalaryBands) {
> kount = df$x[i]
> myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i]
> outIndex = outIndex+kount
> }
> head(myVar)
> plyr::count(myVar)
>
>
> On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
>
>>
>>
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>> x y
>> [1,] 10 4000
>> [2,] 3 1000
>> [3,] 3 4000
>> [4,] 8 6000
>> [5,] 2 9000
>> [6,] 3 8000
>> [7,] 2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list