[R] formatting a 6 million row data set; creating a censoring variable

Thu Sep 1 10:59:36 CEST 2011

This is the fastest data.table way I can think of :

ans = mydt[,list(mytime=.N),by=list(id,mygroup)]
ans[,censor:=0L]
ans[J(unique(id)), censor:=1L, mult="last"]
     id mygroup mytime censor
[1,]  1       A      1      1
[2,]  2       B      3      0
[3,]  2       C      3      0
[4,]  2       D      6      1
[5,]  3       A      3      0
[6,]  3       B      3      1
[7,]  4       A      1      1

>  I'll post the timings on the real data set shortly.
Please do.

Matthew

"William Dunlap" <wdunlap at tibco.com> wrote in message 
news:E66794E69CFDE04D9A70842786030B9304E857 at PA-MBX04.na.tibco.com...
> I'll assume that all of an individual's data rows
> are contiguous and that an individual always passes through
> the groups in order (or, least, the individual
> never leaves a group and then reenters it), so we
> can find everything we need to know by comparing each
> row with the previous row.
>
> You can use rle() to quickly make the time
> column:
>  > rle(paste(d$mygroup, d$id))$lengths
>  [1] 1 3 3 6 3 3 1
>
> For the censor column it is probably easiest to consider
> what rle() must do internally and use a modification of that.
> E.g.,
>  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
>  isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>  outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id)
>  output <- d[outputRows, ]
>  output$mytime <- diff(c(0, which(outputRows)))
>  output$censor <- as.integer(isLastInRun(e$id))
> which gives you
>  > output
>     gender mygroup id mytimes censor
>  1       F       A  1       1      1
>  4       F       B  2       3      0
>  7       F       C  2       3      0
>  13      F       D  2       6      1
>  16      M       A  3       3      0
>  19      M       B  3       3      1
>  20      M       A  4       1      1
> You showed a rearrangment of the columns
>  > output[, c("id", "mygroup", "mytime", "censor")]
>     id mygroup mytime censor
>  1   1       A      1      1
>  4   2       B      3      0
>  7   2       C      3      0
>  13  2       D      6      1
>  16  3       A      3      0
>  19  3       B      3      1
>  20  4       A      1      1
> This ought to be quicker than plyr, but data.table
> may do similar run-oriented operations.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] 
>> On Behalf Of Juliet Hannah
>> Sent: Wednesday, August 31, 2011 10:51 AM
>> To: r-help at r-project.org
>> Subject: [R] formatting a 6 million row data set; creating a censoring 
>> variable
>>
>> List,
>>
>> Consider the following data.
>>
>>    gender mygroup id
>> 1       F       A  1
>> 2       F       B  2
>> 3       F       B  2
>> 4       F       B  2
>> 5       F       C  2
>> 6       F       C  2
>> 7       F       C  2
>> 8       F       D  2
>> 9       F       D  2
>> 10      F       D  2
>> 11      F       D  2
>> 12      F       D  2
>> 13      F       D  2
>> 14      M       A  3
>> 15      M       A  3
>> 16      M       A  3
>> 17      M       B  3
>> 18      M       B  3
>> 19      M       B  3
>> 20      M       A  4
>>
>> Here is the reshaping I am seeking (explanation below).
>>
>>      id mygroup mytime censor
>> [1,]  1       A      1      1
>> [2,]  2       B      3      0
>> [3,]  2       C      3      0
>> [4,]  2       D      6      1
>> [5,]  3       A      3      0
>> [6,]  3       B      3      1
>> [7,]  4       A      1      1
>>
>> I need to create 2 variables. The first one is a time variable.
>> Observe that for id=2, the variable mygroup=B was observed 3 times. In
>> the solution we see in row 2 that id=2 has a mytime variable of 3.
>>
>> Next, I need to create a censoring variable.
>>
>> Notice id=2 goes through has values of B, C, D for mygroup. This means
>> the change from B to C and C to D is observed.  There is no change
>> from D. I need to indicate this with a 'censoring' variable. So B and
>> C would have values 0, and D would have a value of 1. As another
>> example, id=1 never changes, so I assign it  censor= 1. Overall, if a
>> change is observed, 0 should be assigned, and if a change is not
>> observed 1 should be assigned.
>>
>> One potential challenge is that the original data set has over 5
>> million rows. I have ideas, but I'm still getting used the the
>> data.table and plyr syntax.  I also seek a base R solution. I'll post
>> the timings on the real data set shortly.
>>
>> Thanks for your help.
>>
>> > sessionInfo()
>> R version 2.13.1 (2011-07-08)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> # Here is a simplified data set
>>
>> myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
>> "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
>> ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
>> "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
>> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
>> "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
>> "id"), class = "data.frame", row.names = c(NA, -20L))
>>
>>
>> # here is plyr solution with  idata.frame
>>
>> library(plyr)
>> imyData <-  idata.frame(myData)
>> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
>> mytime = length(mygroup)))
>>
>> makeCensor <- function(x) {
>>    myvec <- rep(0,length(x))
>>    lastInd <- length(myvec)
>>    myvec[lastInd] = 1
>>    myvec
>> }
>>
>>
>> plyrSolution <- ddply(timeData, "id", transform, censor = 
>> makeCensor(mygroup))
>>
>>
>> # here is a data table solution
>> # use makeCensor function from above
>>
>> library(data.table)
>> mydt <- data.table(myData)
>> setkey(mydt,id,mygroup)
>>
>> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
>> makeCensor <- function(x) {
>>    myvec <- rep(0,length(x))
>>    lastInd <- length(myvec)
>>    myvec[lastInd] = 1
>>    myvec
>> }
>>
>> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
>> datatableSolution <- cbind(timeData,mycensor[,list(censor)])
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>