[R] maintaining row connections during aggregate

Dennis Murphy djmuser at gmail.com
Tue Jun 14 00:44:58 CEST 2011


Hi:

Here are two ways to do it - one with ddply() in the plyr package and
another with package data.table.

# Toy data frame:
tsdf <- data.frame(year = rep(c(1960:1963), c(366, rep(365, 3))),
                   jday = c(1:366, rep(1:365, 3)),
                   y = rnorm(4*365 + 1))

# A function to output maximum response and the day on which it occurs
# For use in ddply(), f() needs to input a data frame df and output a data frame
f <- function(df) data.frame(max_day = df$jday[which.max(df$y)],
                             ymax = max(df$y))
ddply(tsdf, .(year), f)

# In data.table, one can pass the core of f() in as a list instead:
library(data.table)
tsdt <- data.table(tsdf, key = 'year')
tsdt[, list(max_day = jday[which.max(y)], ymax = max(y)), by = 'year']

If you intend to do a lot of data summarization, these two packages,
along with reshape2 and doBy, are worth being familiar with.

HTH,
Dennis

On Mon, Jun 13, 2011 at 1:30 PM, Kara Przeczek <przeczek at unbc.ca> wrote:
> Dear All,
> I have several sets of data such as this:
>
>  year jday  avg_m3s
> 1 1960    1 4.262307
> 2 1960    2 4.242308
> 3 1960    3 4.216923
> 4 1960    4 4.185385
> 5 1960    5 4.151538
> 6 1960    6 4.133846
>  ...
>
> There is a value for each day of multiple years. In this particular data set it goes up to 1974. I am am looking to obtain the minimum and maximum values for each year, but also know on which julian day ("jday") they occurred.
> I can get the maximum value for each year with:
>
>> mx = aggregate(ddat$avg_m3s, list(Year=ddat$year), max, na.rm=T)
>> colnames(mx) <- c("year","max_daily")
>
>   year max_daily
> 1  1960  60.24615
> 2  1961  73.90000
> 3  1962  56.40000
> ...
>
>
> But I want to output the max with the corresponding day on which it occurred, such as:
>  year jday  avg_m3s
> 1 1960    136 60.24615
> 2 1961    129 73.90000
> 3 1962    111 56.40000
>
>
> I haven't been able to determine how to keep those ties without aggregating by both year *and day, which is what happened with:
> aggregate(ddat$avg_m3s, list(Year=ddat$year, Day = ddat$jday), max, na.rm=T),
> resulting in a value output for every single day of each year.
>
> Other attempts to get both columns to output failed.
>
> Any help would be greatly appreciated!
> Kara
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list