[R] reshaping a dataset

Denis Chabot chabotd at globetrotter.net
Wed Sep 13 05:44:56 CEST 2006


Hi,

I'm trying to move to R the last few data handling routines I was  
performing in SAS.

I'm working on stomach content data. In the simplified example I  
provide below, there are variables describing the origin of each prey  
item (nbpc is a ship number, each ship may have been used on  
different trips, each trip has stations, and individual fish (tagno)  
can be caught at each station.

For each stomach the number of lines corresponds to the number of  
prey items. Thus a variable identifies prey type, and others (here  
only one, mass) provide information on prey abundance or size or  
digestion level.

Finally, there can be accompanying variables that are not used but  
that I need to keep for later analyses (e.g. depth in the example  
below).

At some point I need to transform such a dataset into another format  
where each stomach occupies a single line, and there are columns for  
each prey item.

The "reshape" function works really well, my program is in fact  
simpler than the SAS equivalent (not shown, don't want to bore you,  
but available on request), except that I need zeros when prey types  
are absent from a stomach instead of NAs, a problem for which I only  
have a shaky solution at the moment:

1) creation of a dummy dataset:
#######
nbpc <- rep(c(20,34), c(110,90))
trip <- c(rep(1:3, c(40, 40, 30)), rep(1:2, c(60,30)))
set <- c(rep(1:4, c(10, 8, 7, 15)), rep(c(10,12), c(25,15)), rep(1:3,  
rep(10,3)),
          rep(10:12, c(20, 10, 30)), rep(7:8, rep(15,2)))
depth <- c(rep(c(100, 150, 200, 250), c(10, 8, 7, 15)), rep(c 
(100,120), c(25,15)), rep(c(75, 50, 200), rep(10,3)),
          rep(c(200, 150, 50), c(20, 10, 30)), rep(c(100, 250), rep 
(15,2)))
tagno <- rep(round(runif(42,1,200)),
              c(7,3, 4,4, 2,2,3, 5,5,5,  4,6,4,3,5,3, 7,8, 4,6, 5,5,  
7,3,
                6,6,4,4, 4,6, 3,3,4,5,5,6,4, 5,5,5, 8,7))
prey.codes <-c(187, 438, 792, 811)
prey <- sample(prey.codes, 200, replace=T)
mass <- runif(200, 0, 10)

test <- data.frame(nbpc, trip, set, depth, tagno, prey, mass)
########

Because there are often multiple occurrences of the same prey in a  
single stomach, I need to sum them for each stomach before using  
"reshape". Here I use summarizeBy because my understanding of the  
many variants of "apply" is not very good:

########
test2 <- summaryBy(mass~nbpc+trip+set+tagno+prey, data=test, FUN=sum,  
keep.names=T, id=~depth)

#this messes up sorting order, I fix it
k <- order(test2$nbpc, test2$trip, test2$set, test2$tagno)
test3 <- test2[k,]
result <- reshape(test3, v.names="mass", idvar=c("nbpc", "trip",  
"set", "tagno"),
                 timevar="prey", direction="wide")
#########

I'm quite happy with this, although you may know of better ways of  
doing it.
But my problem is with preys that are absent from a stomach. In later  
analyses, I need them to have zero abundance instead of NA.
My shaky solution is:
#########
empties <- is.na(result)
result[empties] <- 0
#########

which did the job in this example, but it won't always. For instance  
there could have been NAs for "depth", which I do not want to become  
zero.

Is there a way to transform NAs into zeros for multiple columns of a  
dataframe in one step, while ignoring some columns?

Or maybe there is another way to achieve this that would have put  
zeros where I need them (i.e. something else than "reshape")?

Thanking you in advance,

Denis Chabot



More information about the R-help mailing list