[R] Reshaping a dataframe with conditional summary of columns - apply or reshape?
Dave Hewitt
dhewitt at vims.edu
Thu Jan 24 05:31:15 CET 2008
I need to do a somewhat complex rearrangement of an existing dataframe that
involves an "on-the-fly" conditional summary as well. I've tried to make
the example as simple as possible. Code to produce the object being acted
on is at the end of the message. I hope this isn't too much for a single
posting. Any guidance will be much appreciated.
The input example looks like this:
survey year count location var1 var2
1 1 2 A1 21.2 1
2 1 0 A1 15.6 1
3 1 29 A1 12.1 1
1 1 11 B3 9 2
3 1 5 B3 4.8 2
1 2 7 A2 20.1 4
2 2 2 A2 19.2 4
I need to rearrange it to look like this (the order of the columns is not
important):
location year survey1 survey2 survey3 var1summ var2
A1 1 1 0 2 16.3 1
B3 1 2 NA 1 6.9 2
A2 2 1 1 NA 19.7 4
The operations/conditions that I need are:
(A) The observations in "count" for each survey at each location in a given
year need to be reduced according to the following rules and then placed in
new columns "survey1", "survey2", "survey3" :
-- when count = 0, surveyX = 0
-- when 0 < count < 11, surveyX = 1
-- when 11 < count, surveyX = 2
(B) The observations in var1 need to be summarized (here, by taking the
mean) across all surveys for a given location in a given year and placed in
"var1summ"
(C) Observations in var2 need to be summarized in a similar way, except
that they don't change across surveys for a given location in a given year
(so mean, median, etc. would work, but a more general solution would be
nice for variables that are categorical)
(D) Surveys (rows) can be missing for a given location in a given year. I'd
like to retain a column for each survey in the new dataframe and denote
surveys that were missing as NA. E.g., survey2 is missing for location B3
in year 1.
** Object creation:
x <- structure(list(survey = c(1L, 2L, 3L, 1L, 3L, 1L, 2L), year = c(1L,
1L, 1L, 1L, 1L, 2L, 2L), count = c(2L, 0L, 29L, 11L, 5L, 7L, 2L), location
= structure(c(1L, 1L, 1L, 3L, 3L, 2L, 2L), .Label = c("A1", "A2", "B3"),
class = "factor"), var1 = c(21.2, 15.6, 12.1, 9, 4.8, 20.1, 19.2), var2 =
c(1L, 1L, 1L, 2L, 2L, 4L, 4L)), .Names = c("survey", "year", "count",
"location", "var1", "var2"), class = "data.frame", row.names = c(NA, -7L))
Thanks,
Dave Hewitt
Fisheries Science
Virginia Institute of Marine Science, Gloucester Point, VA, USA
More information about the R-help
mailing list