[R] summarizing a data frame i.e. count -> group by
jim holtman
jholtman at gmail.com
Mon Oct 24 01:31:04 CEST 2011
Another package to consider, especially if your dataframe is large, is
'data.table':
> tp <- read.table(textConnection(" time partitioning_mode workload runtime
+ 1 1 sharding query 607
+ 2 1 sharding query 85
+ 3 1 sharding query 52
+ 4 1 sharding query 79
+ 5 1 sharding query 77
+ 6 1 sharding query 67
+ 7 1 sharding query 98
+ 8 1 sharding refresh 2932
+ 9 1 sharding refresh 2870
+ 10 1 sharding refresh 2877
+ 11 1 sharding refresh 2868
+ 12 1 replication query 2891
+ 13 1 replication query 2907
+ 14 1 replication query 2922
+ 15 1 replication query 2937"), as.is = TRUE, header = TRUE)
> closeAllConnections()
>
> require(data.table)
Loading required package: data.table
data.table 1.7.1 For help type: help("data.table")
> tp <- data.table(tp)
> tp[
+ , list(workload = workload
+ , runtime = runtime
+ , thruput = length(runtime)
+ )
+ , by = list(time, partitioning_mode)
+ ]
time partitioning_mode workload runtime thruput
[1,] 1 sharding query 607 11
[2,] 1 sharding query 85 11
[3,] 1 sharding query 52 11
[4,] 1 sharding query 79 11
[5,] 1 sharding query 77 11
[6,] 1 sharding query 67 11
[7,] 1 sharding query 98 11
[8,] 1 sharding refresh 2932 11
[9,] 1 sharding refresh 2870 11
[10,] 1 sharding refresh 2877 11
[11,] 1 sharding refresh 2868 11
[12,] 1 replication query 2891 4
[13,] 1 replication query 2907 4
[14,] 1 replication query 2922 4
[15,] 1 replication query 2937 4
On Sun, Oct 23, 2011 at 1:29 PM, Giovanni Azua <bravegag at gmail.com> wrote:
> Hello,
>
> This is one problem at the time :)
>
> I have a data frame df that looks like this:
>
> time partitioning_mode workload runtime
> 1 1 sharding query 607
> 2 1 sharding query 85
> 3 1 sharding query 52
> 4 1 sharding query 79
> 5 1 sharding query 77
> 6 1 sharding query 67
> 7 1 sharding query 98
> 8 1 sharding refresh 2932
> 9 1 sharding refresh 2870
> 10 1 sharding refresh 2877
> 11 1 sharding refresh 2868
> 12 1 replication query 2891
> 13 1 replication query 2907
> 14 1 replication query 2922
> 15 1 replication query 2937
>
> and if I could use SQL ... omg! I really wish I could! I would do exactly this:
>
> insert into throughput
> select time, partitioning_mode, count(*)
> from data.frame
> group by time, partitioning_mode
>
> My attempted R versions are wrong and produce very cryptic error message:
>
>> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
>> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)
> Error in `[.default`(df2, u_id, , drop = FALSE) :
> incorrect number of dimensions
>
>>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)
> I cant comprehend what comes out from this one ... :(
>
> and I thought C++ template errors were the most cryptic ;P
>
> Many many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
More information about the R-help
mailing list