[R] Aggregate and select mode

Mon Mar 15 22:32:11 CET 2010

This gives the first mode in each group:

# test data
xx <- structure(list(v1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1),
  v2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 3L, 3L, 5L, 1L),
    .Label = c("A", "B", "D", "W", "Z"), class = "factor")),
  .Names = c("v1", "v2"), row.names = c(NA, 10L), class = "data.frame")

# 1. first mode only in each group

Mode2 <- function(x) { tab <- table(x); names(tab)[which.max(tab)] }
aggregate(xx["v2"], xx["v1"], Mode2)

# 2. All modes in each group:

Mode3 <- function(x) { tab <- table(x); names(tab)[tab == max(tab)] }
do.call(rbind, by(xx, xx$v1, function(x) data.frame(v1 = x$v1[1], v2 =
Mode3(x$v2))))

# 3. Here is an SQL solution giving all modes in each group:

library(sqldf)
sqldf("select v1, v2 from xx a group by v1, v2 having count(*) >=
   (select max(k) from
      (select v1, v2, count(*) as k from xx b where b.v1 = a.v1 group
by v1, v2) s)")

The inner select gets the counts for all groups having the same v1 as
the current v1 and the select surrounding it takes the largest of
those.  You could also check if replacing the >= with = makes any
difference to the time.

You could also try the above query with PostgreSQL. Just issue the
library statement below and then repeat the sqldf statement just
given. sqldf checks whether RpgSQL is loaded and if it is then it
automatically uses PostgreSQL instead of sqlite. Note that the first
time you execute sqldf with PostgreSQL in a session it will load java
so disregard the timing of the first run.

# 4. PostgreSQL
library(RpgSQL)
# repeat sqldf statement above

On Mon, Mar 15, 2010 at 1:43 PM, Gabriel Yospin <yosping at gmail.com> wrote:
> Greetings Everyone -
>
> I have a data frame "x" that looks like this:
>
> v1   v2
> 1     A
> 1     B
> 1     B
> 2     B
> 2     W
> 2     W
> 3     D
> 3     D
> 3     Z
>
> What I would like to do is create a new data frame, "y", that has one row
> for each unique value of v1, and returns the corresponding mode of v2.  If I
> were to run it on the above data frame, it should therefore return:
>
> v1   v2
> 1     B
> 2     W
> 3     D
>
> I've been using the following code:
>
> x <- data.frame(v1 = c(1,1,1,2,2,2,3,3,3), v2 =
> c("A","B","B","B","W","W","D","D","Z"))
> y <- aggregate.data.frame(x, by = list(x$var1), FUN = "Mode")
>
> which relies on the Mode function from package prettyR.  The above code
> works for me.
>
> My problem comes when I use my real database.  Running this produces many
> warnings, because there are multiple modes of v2 for many values of v1.  My
> database is also rather large (~700,000 rows), and I'm wondering if there is
> a faster way to get R to process these data.
>
> Thank you for your help and consideration,
>
> Gabriel Yospin
> Center for Ecology and Evolutionary Biology
> University of Oregon
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>