[R] summarizing a data frame i.e. count -> group by

Mon Oct 24 01:31:04 CEST 2011

Another package to consider, especially if your dataframe is large, is
'data.table':

> tp <- read.table(textConnection(" time partitioning_mode workload runtime
+ 1     1          sharding    query     607
+ 2     1          sharding    query      85
+ 3     1          sharding    query      52
+ 4     1          sharding    query      79
+ 5     1          sharding    query      77
+ 6     1          sharding    query      67
+ 7     1          sharding    query      98
+ 8     1          sharding  refresh    2932
+ 9     1          sharding  refresh    2870
+ 10    1          sharding  refresh    2877
+ 11    1          sharding  refresh    2868
+ 12    1       replication    query    2891
+ 13    1       replication    query    2907
+ 14    1       replication    query    2922
+ 15    1       replication    query    2937"), as.is = TRUE, header = TRUE)
> closeAllConnections()
>
> require(data.table)
Loading required package: data.table
data.table 1.7.1  For help type: help("data.table")
> tp <- data.table(tp)
> tp[
+     , list(workload = workload
+         , runtime = runtime
+         , thruput = length(runtime)
+         )
+     , by = list(time, partitioning_mode)
+     ]
      time partitioning_mode workload runtime thruput
 [1,]    1          sharding    query     607      11
 [2,]    1          sharding    query      85      11
 [3,]    1          sharding    query      52      11
 [4,]    1          sharding    query      79      11
 [5,]    1          sharding    query      77      11
 [6,]    1          sharding    query      67      11
 [7,]    1          sharding    query      98      11
 [8,]    1          sharding  refresh    2932      11
 [9,]    1          sharding  refresh    2870      11
[10,]    1          sharding  refresh    2877      11
[11,]    1          sharding  refresh    2868      11
[12,]    1       replication    query    2891       4
[13,]    1       replication    query    2907       4
[14,]    1       replication    query    2922       4
[15,]    1       replication    query    2937       4

On Sun, Oct 23, 2011 at 1:29 PM, Giovanni Azua <bravegag at gmail.com> wrote:
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
>  time partitioning_mode workload runtime
> 1     1          sharding    query     607
> 2     1          sharding    query      85
> 3     1          sharding    query      52
> 4     1          sharding    query      79
> 5     1          sharding    query      77
> 6     1          sharding    query      67
> 7     1          sharding    query      98
> 8     1          sharding  refresh    2932
> 9     1          sharding  refresh    2870
> 10    1          sharding  refresh    2877
> 11    1          sharding  refresh    2868
> 12    1       replication    query    2891
> 13    1       replication    query    2907
> 14    1       replication    query    2922
> 15    1       replication    query    2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly this:
>
> insert into throughput
>  select time, partitioning_mode, count(*)
>  from data.frame
>  group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
>> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
>  incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
>  incorrect number of dimensions
>
>>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?