[R] summarizing a data frame i.e. count -> group by
David Winsemius
dwinsemius at comcast.net
Sun Oct 23 20:00:47 CEST 2011
On Oct 23, 2011, at 1:29 PM, Giovanni Azua wrote:
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> df <-read.table(textConnection(" time partitioning_mode workload
runtime
+ 1 1 sharding query 607
+ 2 1 sharding query 85
+ 3 1 sharding query 52
+ 4 1 sharding query 79
+ 5 1 sharding query 77
+ 6 1 sharding query 67
+ 7 1 sharding query 98
+ 8 1 sharding refresh 2932
+ 9 1 sharding refresh 2870
+ 10 1 sharding refresh 2877
+ 11 1 sharding refresh 2868
+ 12 1 replication query 2891
+ 13 1 replication query 2907
+ 14 1 replication query 2922
+ 15 1 replication query 2937"))
>
> df$throughput <- ave(df$time, list(df$time, df$partitioning_mode),
FUN=length)
> df
time partitioning_mode workload runtime throughput
1 1 sharding query 607 11
2 1 sharding query 85 11
3 1 sharding query 52 11
4 1 sharding query 79 11
5 1 sharding query 77 11
6 1 sharding query 67 11
7 1 sharding query 98 11
8 1 sharding refresh 2932 11
9 1 sharding refresh 2870 11
10 1 sharding refresh 2877 11
11 1 sharding refresh 2868 11
12 1 replication query 2891 4
13 1 replication query 2907 4
14 1 replication query 2922 4
15 1 replication query 2937 4
>
> and if I could use SQL ... omg! I really wish I could! I would do
> exactly this:
You can of, course use package sqldf, which would undoubtedly be good
practice for me, but this seemed like a typical situation for using
'ave'. You do need to use the FUN= construction in 'ave' because that
argument appears after the triple dots in the argument list.
>
> insert into throughput
> select time, partitioning_mode, count(*)
> from data.frame
> group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error
> message:
>
>> throughput <- aggregate(x=df[,c("time", "partitioning_mode")],
>> by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df
>> $partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
>> throughput <- tapply(X=df$time, INDEX=list(df$time,df
>> $partitioning), FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list