[R] formatting a 6 million row data set; creating a censoring variable

Wed Aug 31 19:51:29 CEST 2011

List,

Consider the following data.

   gender mygroup id
1       F       A  1
2       F       B  2
3       F       B  2
4       F       B  2
5       F       C  2
6       F       C  2
7       F       C  2
8       F       D  2
9       F       D  2
10      F       D  2
11      F       D  2
12      F       D  2
13      F       D  2
14      M       A  3
15      M       A  3
16      M       A  3
17      M       B  3
18      M       B  3
19      M       B  3
20      M       A  4

Here is the reshaping I am seeking (explanation below).

     id mygroup mytime censor
[1,]  1       A      1      1
[2,]  2       B      3      0
[3,]  2       C      3      0
[4,]  2       D      6      1
[5,]  3       A      3      0
[6,]  3       B      3      1
[7,]  4       A      1      1

I need to create 2 variables. The first one is a time variable.
Observe that for id=2, the variable mygroup=B was observed 3 times. In
the solution we see in row 2 that id=2 has a mytime variable of 3.

Next, I need to create a censoring variable.

Notice id=2 goes through has values of B, C, D for mygroup. This means
the change from B to C and C to D is observed.  There is no change
from D. I need to indicate this with a 'censoring' variable. So B and
C would have values 0, and D would have a value of 1. As another
example, id=1 never changes, so I assign it  censor= 1. Overall, if a
change is observed, 0 should be assigned, and if a change is not
observed 1 should be assigned.

One potential challenge is that the original data set has over 5
million rows. I have ideas, but I'm still getting used the the
data.table and plyr syntax.  I also seek a base R solution. I'll post
the timings on the real data set shortly.

Thanks for your help.

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

# Here is a simplified data set

myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
"F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
"D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
"3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
"id"), class = "data.frame", row.names = c(NA, -20L))

# here is plyr solution with  idata.frame

library(plyr)
imyData <-  idata.frame(myData)
timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
mytime = length(mygroup)))

makeCensor <- function(x) {
   myvec <- rep(0,length(x))
   lastInd <- length(myvec)
   myvec[lastInd] = 1
   myvec
}

plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup))

# here is a data table solution
# use makeCensor function from above

library(data.table)
mydt <- data.table(myData)
setkey(mydt,id,mygroup)

timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
makeCensor <- function(x) {
   myvec <- rep(0,length(x))
   lastInd <- length(myvec)
   myvec[lastInd] = 1
   myvec
}

mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
datatableSolution <- cbind(timeData,mycensor[,list(censor)])