[R] help using tapply
Gabor Grothendieck
ggrothendieck at gmail.com
Wed Apr 26 20:28:50 CEST 2006
On 4/26/06, Dimitri Szerman <dimitrijoe at ipea.gov.br> wrote:
> Dear R-mates,
>
> # Here's what I am trying to do. I have a dataset like this:
>
> id = c(rep(1,8), rep(2,8))
> dur1 <- c( 17,18,19,18,24,19,24,24 )
> est1 <- c( rep(1,5), rep(2,3) )
> dur2 <- c(1,1,3,4,8,12,13,14)
> est2 <- rep(1,8)
>
> mydata = data.frame(id,
> estat=c(est1, est2),
> durat=c(dur1, dur2))
>
>
> # I want to one have this:
>
> id = c(rep(1,8), rep(2,8))
> dur1 <- c( 17,18,19,20,28,1,2,3 )
> est1 <- c( rep(1,5), rep(2,3) )
> dur2 <- c(1,2,3,4,12,13,14,15)
> est2 <- rep(1,8)
>
> mydata2 = = data.frame(id,
> estat=c(est1, est2),
> durat=c(dur1, dur2))
>
>
> # What is happening here? I have a longitudinal dataset.
> # Individuals are observed 8 times, and each time each of them are in a
> certain state J (here, J={1,2}).
> # Each observation is one unit of time away from the following one, except
> observations 4 and 5, which are 8 units of time away from each other.
> # So here we have individual 1 migrating from state 1 to state 2 at
> observation #6,
> # while individual 2 stays in state 1 as long as we can observe her.
> # I am interested in the spell (duration) of each state.
> # However, the durations are clearly mismesuared, and now I am trying to
> give some consistency to the data.
> # I am assuming that the first duration is correct. Departing from this, I
> wrote the following function:
>
> d <- function(dur,est)
> {
> if ( sum( diff(est) )==0 ) # for those who didn't change state
> {
> for( i in c(2:4))
> dur[i] <- dur[i-1] + 1
>
> dur[5] <- dur[4] + 8
>
> for( i in c(6:8) )
> dur[i] <- dur[i-1] + 1
> }
> if ( sum( diff(est) )!=0 ) # for those who changed state
> {
> j = which(diff(est)!=0) + 1 # j is when the change occured
> dur[j] = 1
>
> k0 = which( c(1:8) < j )[-c(1)]
> k1 = which( c(1:8) > j )
> if(length(j) > 1)
> {
> for( i in 1:(length(j)-1) )
> k2 = c(1:8)[c(1:8)> j[i] & c(1:8)< j[i+1]]
> k = unique( c(k0,k1,k2) )
> }
> k = unique( c(k0,k1) )
> k = k[!k%in%j]
> if(5%in%k)
> {
> k = k[k != 5]
> for(i in k[k<5])
> dur[i] = dur[i-1] + 1
>
> dur[5] = dur[4] + 8
>
> for(i in k[k>5])
> dur[i] = dur[i-1] + 1
> } else
> {
> for(i in k)
> dur[i] = dur[i-1] + 1
> }
> }
> dur
>
> }
>
> # Now, if a do
>
> d(dur1, est1)
> # and
> d(dur2,est2)
> # I get what I want, except from the fact that I couldn't do this for a
> large dataset.
> # So I decide to use tapply. But this gives me
>
> new.durat <- tapply(mydata$durat, IND=mydata$id, FUN=d,
> est=mydata$estat)
> mydata$new.durat <- unlist(new.durat)
>
> > mydata
> id estat durat new.durat
> 1 1 1 17 17
> 2 1 1 18 18
> 3 1 1 19 19
> 4 1 1 18 20
> 5 1 1 24 28
> 6 1 2 19 29
> 7 1 2 24 30
> 8 1 2 24 31
> 9 2 1 1 1
> 10 2 1 1 2
> 11 2 1 3 3
> 12 2 1 4 4
> 13 2 1 8 12
> 14 2 1 12 13
> 15 2 1 13 14
> 16 2 1 14 15
>
> # what is not what I want. I can't figure it out why, but when I use tapply,
> # the logical expression "sum( diff(est) )==0" turns out to be true for both
> individuals
> # (whereas we know this is true only for individual #2).
> # I am sorry for the long message. I will be very grateful for any help with
> this problem.
I didn't try to read all this carefully but I think you want to tapply
over the indices so you can use them in both columns:
with(mydata,
unlist(tapply(seq(id), id, function(i) d(durat[i], estat[i])))
)
or use by:
unlist(by(mydata, mydata$id, function(x) d(x$durat, x$estat)))
More information about the R-help
mailing list