[R] formatting a 6 million row data set; creating a censoring variable
Juliet Hannah
juliet.hannah at gmail.com
Wed Aug 31 19:51:29 CEST 2011
List,
Consider the following data.
gender mygroup id
1 F A 1
2 F B 2
3 F B 2
4 F B 2
5 F C 2
6 F C 2
7 F C 2
8 F D 2
9 F D 2
10 F D 2
11 F D 2
12 F D 2
13 F D 2
14 M A 3
15 M A 3
16 M A 3
17 M B 3
18 M B 3
19 M B 3
20 M A 4
Here is the reshaping I am seeking (explanation below).
id mygroup mytime censor
[1,] 1 A 1 1
[2,] 2 B 3 0
[3,] 2 C 3 0
[4,] 2 D 6 1
[5,] 3 A 3 0
[6,] 3 B 3 1
[7,] 4 A 1 1
I need to create 2 variables. The first one is a time variable.
Observe that for id=2, the variable mygroup=B was observed 3 times. In
the solution we see in row 2 that id=2 has a mytime variable of 3.
Next, I need to create a censoring variable.
Notice id=2 goes through has values of B, C, D for mygroup. This means
the change from B to C and C to D is observed. There is no change
from D. I need to indicate this with a 'censoring' variable. So B and
C would have values 0, and D would have a value of 1. As another
example, id=1 never changes, so I assign it censor= 1. Overall, if a
change is observed, 0 should be assigned, and if a change is not
observed 1 should be assigned.
One potential challenge is that the original data set has over 5
million rows. I have ideas, but I'm still getting used the the
data.table and plyr syntax. I also seek a base R solution. I'll post
the timings on the real data set shortly.
Thanks for your help.
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
# Here is a simplified data set
myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
"F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
"D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
"3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
"id"), class = "data.frame", row.names = c(NA, -20L))
# here is plyr solution with idata.frame
library(plyr)
imyData <- idata.frame(myData)
timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
mytime = length(mygroup)))
makeCensor <- function(x) {
myvec <- rep(0,length(x))
lastInd <- length(myvec)
myvec[lastInd] = 1
myvec
}
plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup))
# here is a data table solution
# use makeCensor function from above
library(data.table)
mydt <- data.table(myData)
setkey(mydt,id,mygroup)
timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
makeCensor <- function(x) {
myvec <- rep(0,length(x))
lastInd <- length(myvec)
myvec[lastInd] = 1
myvec
}
mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
datatableSolution <- cbind(timeData,mycensor[,list(censor)])
More information about the R-help
mailing list