[R] reshaping a dataset
Denis Chabot
chabotd at globetrotter.net
Wed Sep 13 05:44:56 CEST 2006
Hi,
I'm trying to move to R the last few data handling routines I was
performing in SAS.
I'm working on stomach content data. In the simplified example I
provide below, there are variables describing the origin of each prey
item (nbpc is a ship number, each ship may have been used on
different trips, each trip has stations, and individual fish (tagno)
can be caught at each station.
For each stomach the number of lines corresponds to the number of
prey items. Thus a variable identifies prey type, and others (here
only one, mass) provide information on prey abundance or size or
digestion level.
Finally, there can be accompanying variables that are not used but
that I need to keep for later analyses (e.g. depth in the example
below).
At some point I need to transform such a dataset into another format
where each stomach occupies a single line, and there are columns for
each prey item.
The "reshape" function works really well, my program is in fact
simpler than the SAS equivalent (not shown, don't want to bore you,
but available on request), except that I need zeros when prey types
are absent from a stomach instead of NAs, a problem for which I only
have a shaky solution at the moment:
1) creation of a dummy dataset:
#######
nbpc <- rep(c(20,34), c(110,90))
trip <- c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30)))
set <- c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep(1:3,
rep(10,3)),
rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2)))
depth <- c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c
(100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)),
rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep
(15,2)))
tagno <- rep(round(runif(42,1,200)),
c(7,3, 4,4, 2,2,3, 5,5,5, 4,6,4,3,5,3, 7,8, 4,6, 5,5,
7,3,
6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7))
prey.codes <-c(187, 438, 792, 811)
prey <- sample(prey.codes, 200, replace=T)
mass <- runif(200, 0, 10)
test <- data.frame(nbpc, trip, set, depth, tagno, prey, mass)
########
Because there are often multiple occurrences of the same prey in a
single stomach, I need to sum them for each stomach before using
"reshape". Here I use summarizeBy because my understanding of the
many variants of "apply" is not very good:
########
test2 <- summaryBy(mass~nbpc+trip+set+tagno+prey, data=test, FUN=sum,
keep.names=T, id=~depth)
#this messes up sorting order, I fix it
k <- order(test2$nbpc, test2$trip, test2$set, test2$tagno)
test3 <- test2[k,]
result <- reshape(test3, v.names="mass", idvar=c("nbpc", "trip",
"set", "tagno"),
timevar="prey", direction="wide")
#########
I'm quite happy with this, although you may know of better ways of
doing it.
But my problem is with preys that are absent from a stomach. In later
analyses, I need them to have zero abundance instead of NA.
My shaky solution is:
#########
empties <- is.na(result)
result[empties] <- 0
#########
which did the job in this example, but it won't always. For instance
there could have been NAs for "depth", which I do not want to become
zero.
Is there a way to transform NAs into zeros for multiple columns of a
dataframe in one step, while ignoring some columns?
Or maybe there is another way to achieve this that would have put
zeros where I need them (i.e. something else than "reshape")?
Thanking you in advance,
Denis Chabot
More information about the R-help
mailing list