[R] how to create data.frames from vectors with duplicates
Dennis Murphy
djmuser at gmail.com
Thu Sep 8 03:55:18 CEST 2011
Hi:
Here are a few informal timings on my machine with the following
example. The data.table package is worth investigating, particularly
in problems where its advantages can scale with size.
library(data.table)
dt <- data.table(x = sample(1:50, 1000000, replace = TRUE),
y = sample(letters[1:26], 1000000, replace = TRUE),
key = 'y')
system.time(dt[, list(count = sum(x)), by = 'y'])
user system elapsed
0.02 0.00 0.02
# Data tables are also data frames, so we can use them as such:
system.time(with(dt, tapply(x, y, sum)))
user system elapsed
0.39 0.00 0.39
system.time(with(dt, rowsum(x, y)))
user system elapsed
0.04 0.00 0.03
system.time(aggregate(x ~ y, data = dt, FUN = sum))
user system elapsed
1.87 0.00 1.87
So rowsum() is good, but data.table is a little better for this task.
Increasing the size of the problem is to the advantage of both
data.table and rowsum(), but tapply() takes a fair bit longer,
relatively speaking (appx. 10x rowsum() in the first example, 20x in
the second example). The ratios of rowsum() to data.table are about
the same (appx. 2x).
# 10M observations, 1000 groups
> dt <- data.table(x = sample(1:100, 10000000, replace = TRUE),
+ y = sample(1:1000, 10000000, replace = TRUE),
+ key = 'y')
> system.time(dt[, list(count = sum(x)), by = 'y'])
user system elapsed
0.16 0.03 0.18
> system.time(with(dt, rowsum(x, y)))
user system elapsed
0.36 0.04 0.40
> system.time(with(dt, tapply(x, y, sum)))
user system elapsed
8.77 0.33 9.11
HTH,
Dennis
On Wed, Sep 7, 2011 at 6:18 PM, zhenjiang xu <zhenjiang.xu at gmail.com> wrote:
> Thanks for all your replies. I am using rowsum() and it looks efficient. I
> hope I could do some benchmark sometime in near future and let people know.
> Or is there any benchmark result available?
>
> On Wed, Aug 31, 2011 at 12:58 PM, Bert Gunter <gunter.berton at gene.com>wrote:
>
>> Inline below:
>>
>> On Wed, Aug 31, 2011 at 9:50 AM, Jorge I Velez <jorgeivanvelez at gmail.com>
>> wrote:
>> > Hi Zhenjiang,
>> >
>> > Try
>> >
>> > table(unlist(mapply(function(x, y) rep(x, y), y, x)))
>>
>> Yikes! How about simply tapply(x,y,sum) ??
>> ?tapply
>>
>> -- Bert
>> >
>> > HTH,
>> > Jorge
>> >
>> >
>> > On Wed, Aug 31, 2011 at 12:45 PM, zhenjiang xu <> wrote:
>> >
>> >> Hi R users,
>> >>
>> >> suppose I have two vectors,
>> >> > x=c(1,2,3,4,5)
>> >> > y=c('a','b','c','a','c')
>> >> How can I get a data.frame like this?
>> >> > xy
>> >> count
>> >> a 5
>> >> b 2
>> >> c 8
>> >>
>> >> I know a few ways to fulfill the task. However, I have a huge number
>> >> of this kind calculations, so I'd like an efficient solution. Thanks
>> >>
>> >> --
>> >> Best,
>> >> Zhenjiang
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>
>
>
> --
> Best,
> Zhenjiang
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list