[R] Aggregating data

Dennis Murphy djmuser at gmail.com
Tue Aug 9 07:37:26 CEST 2011


Hi:

David evidently left the ddply() part to me :)

Here's one way to summarize the data and get a plot in ggplot2.
Firstly, thank you for the dput(); you score extra points for that :)
I put that output in an object named results.

## Step 1: Summarize the data in plyr

library('ggplot2')   # also loads plyr and reshape in the process

# (1a)  Compute the group medians; this is your 'table'
#          summarise simply returns the summaries and
#          grouping variable
resumm <- ddply(results, .(date), summarise,
                    C_lo = median(C_lo, na.rm = TRUE),
                    C_hi = median(C_hi, na.rm = TRUE))

# (1a) Add the medians to the existing data frame
#         This is the purpose of transform (as a
#         substitute for summarise in the call)
resumm2 <- ddply(results, .(date), transform,
                    C_lo = median(C_lo, na.rm = TRUE),
                    C_hi = median(C_hi, na.rm = TRUE))

## Aside: It's not a coincidence that the names of the
## median variables are the same in resumm and resumm2.
## This is by design, so that we can generate a 'nice'
## legend below.

# Melt the two summary data frames so that the
# lo/hi variables become merged into a factor with
# a corresponding value variable
resmelt <- melt(results, measure.vars = c('C_lo', 'C_hi'))
resumMelt <- melt(resumm, id.vars = 'date')

## Two plots are now generated. The only real difference
## between them is that the former treats date as
## numeric and the latter treats it as a factor. The lines are
## plotted first so that the points are not obscured.

ggplot(data = resmelt, aes(x = date)) +
   geom_line(data = resumMelt, aes(y = value, group = variable,
                                   colour = variable), size = 1) +
   geom_point(aes(y = value, colour = variable), size = 2.5) +
   labs(x = 'Date', y = 'C', colour = 'Level') +
   scale_colour_manual('variable',
                       breaks = c('C_lo', 'C_hi'),
                       values = c('blue', 'red'),
                       labels = c('Low', 'High'))

ggplot(data = resmelt, aes(x = factor(date))) +
   geom_line(data = resumMelt, aes(x = factor(date), y = value,
             group = variable, colour = variable), size = 1) +
   geom_point(aes(y = value, colour = variable), size = 2.5) +
   labs(x = 'Date', y = 'C', colour = 'Level') +
   scale_colour_manual(breaks = levels(resmelt$variable),
                       values = c('blue', 'red'),
                       labels = c('Low', 'High'))

The manual scale gives one the option to define one's own set of
colors rather than the defaults supplied by ggplot2. In this case I
chose to reset the legend labels, but if C_lo and C_hi are what you
want, remove the two lines with labels = ...

HTH,
Dennis


On Mon, Aug 8, 2011 at 4:51 PM, Jeffrey Joh <johjeffrey at hotmail.com> wrote:
>
> Here is a sample of what I'm trying to do:
>
> structure(list(C_lo = c(0.00392581816943354, 0.00901222644518829,
> 0.00484396253385175, 0.00822377400482716, 0.00780070460187192,
> 0.00952688235337435), C_hi = c(0.00697755827622381, 0.0123301031600017,
> 0.0113207627868435, 0.0112887993422598, 0.018567245397701, 0.0195253894885054
> ),  house = c(1, 1, 1, 1, 1, 1), date = c(719, 1027, 1027,
>    1027, 1030, 1030), hour = c(18, 8, 8, 8, 11, 11),  .Names = c("1000", "10000",
>    "10001", "10002", "10003", "10004"),  press = structure(c(1L,
>    1L, 1L, 1L, 1L, 1L), .Names = c("1000", "10000",
>    "10001", "10002", "10003", "10004"), .Label = c("DEPR",
>    "PRESS"), class = "factor")), .Names = c("C_lo", "C_hi",
> "house", "date", "hour", "number", "press"
> ), class = "data.frame", row.names = c("1000", "10000",
> "10001", "10002", "10003", "10004"))
>
>
>
> I'd like to aggregate the data by the date.  I'd like to have a table with the median C_lo and C_hi values grouped by date.
> I'd also like to plot these points with date on the x-axis, C on y-axis, and lines going through these medians.
>
>
>
> For plyr, would it be something like: ddply(results, .(date),median, na.rm=T)
>
>
>
> I tried making a for loop to get the medians, but that doesn't work either.
> splitresults = split (results, results$date, drop=T)
> mediann <- matrix (,seq_along(splitresults),2)
> for (i in seq_along(splitresults)) {
> piece <- splitresults[[i]]
> mediann [i,1] <- unique(piece$date)
> mediann [i,2] <- median (piece$n, na.rm=T)
> }
>
>
>
> Jeff
>
>
>
> ----------------------------------------
>> Date: Fri, 5 Aug 2011 11:59:37 -0700
>> Subject: Re: [R] Aggregating data
>> From: djmuser at gmail.com
>> To: johjeffrey at hotmail.com
>> CC: r-help at r-project.org
>>
>> Hi:
>>
>> This is the type of problem at which the plyr package excels. Write a
>> utility function that produces the plot you want using a data frame as
>> its input argument, and then do something like
>>
>> library('plyr')
>> d_ply(results, .(a, b, c), plotfun)
>>
>> where plotfun is a placeholder for the name of the name of your plot
>> function. The d in d_ply means to take a data frame as input and _
>> means return nothing. This is used in particular when a side effect,
>> such as a plot, is the desired 'output'. See
>> http://www.jstatsoft.org/v40/i01, which contains an example (baseball)
>> where groupwise plots are produced. (Don't actually run the example
>> unless you're willing to wait for 1100+ ggplots to be rendered :)
>>
>> If memory serves, you should also be able to produce graphics for each
>> data subset using the data.table package as well.
>>
>> If you want a more concrete solution, provide a more concrete example.
>>
>> HTH,
>> Dennis
>>
>> On Fri, Aug 5, 2011 at 9:55 AM, Jeffrey Joh <johjeffrey at hotmail.com> wrote:
>> >
>> >
>> > I aggregated my data: aggresults <-aggregate(results, by=list(results$a, results$b, results$c), FUN=mean, na.rm=TRUE)
>> >
>> >
>> >
>> > results has about 8000 lines of data, and aggresults has about 80 lines. I would like to create a separate variable for each of the 80 aggregates, each containing the 100 lines that were aggregated. I would also like to create plots for each of those 80 datasets.
>> >
>> >
>> >
>> > Is there a way of automating this, so that I don't have to do each of the 80 aggregates individually?
>> >
>> >
>> >
>> > Jeff
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >



More information about the R-help mailing list