[R] formatting a 6 million row data set; creating a censoring variable
Matthew Dowle
mdowle at mdowle.plus.com
Thu Sep 1 10:59:36 CEST 2011
This is the fastest data.table way I can think of :
ans = mydt[,list(mytime=.N),by=list(id,mygroup)]
ans[,censor:=0L]
ans[J(unique(id)), censor:=1L, mult="last"]
id mygroup mytime censor
[1,] 1 A 1 1
[2,] 2 B 3 0
[3,] 2 C 3 0
[4,] 2 D 6 1
[5,] 3 A 3 0
[6,] 3 B 3 1
[7,] 4 A 1 1
> I'll post the timings on the real data set shortly.
Please do.
Matthew
"William Dunlap" <wdunlap at tibco.com> wrote in message
news:E66794E69CFDE04D9A70842786030B9304E857 at PA-MBX04.na.tibco.com...
> I'll assume that all of an individual's data rows
> are contiguous and that an individual always passes through
> the groups in order (or, least, the individual
> never leaves a group and then reenters it), so we
> can find everything we need to know by comparing each
> row with the previous row.
>
> You can use rle() to quickly make the time
> column:
> > rle(paste(d$mygroup, d$id))$lengths
> [1] 1 3 3 6 3 3 1
>
> For the censor column it is probably easiest to consider
> what rle() must do internally and use a modification of that.
> E.g.,
> isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
> isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
> outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id)
> output <- d[outputRows, ]
> output$mytime <- diff(c(0, which(outputRows)))
> output$censor <- as.integer(isLastInRun(e$id))
> which gives you
> > output
> gender mygroup id mytimes censor
> 1 F A 1 1 1
> 4 F B 2 3 0
> 7 F C 2 3 0
> 13 F D 2 6 1
> 16 M A 3 3 0
> 19 M B 3 3 1
> 20 M A 4 1 1
> You showed a rearrangment of the columns
> > output[, c("id", "mygroup", "mytime", "censor")]
> id mygroup mytime censor
> 1 1 A 1 1
> 4 2 B 3 0
> 7 2 C 3 0
> 13 2 D 6 1
> 16 3 A 3 0
> 19 3 B 3 1
> 20 4 A 1 1
> This ought to be quicker than plyr, but data.table
> may do similar run-oriented operations.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>> On Behalf Of Juliet Hannah
>> Sent: Wednesday, August 31, 2011 10:51 AM
>> To: r-help at r-project.org
>> Subject: [R] formatting a 6 million row data set; creating a censoring
>> variable
>>
>> List,
>>
>> Consider the following data.
>>
>> gender mygroup id
>> 1 F A 1
>> 2 F B 2
>> 3 F B 2
>> 4 F B 2
>> 5 F C 2
>> 6 F C 2
>> 7 F C 2
>> 8 F D 2
>> 9 F D 2
>> 10 F D 2
>> 11 F D 2
>> 12 F D 2
>> 13 F D 2
>> 14 M A 3
>> 15 M A 3
>> 16 M A 3
>> 17 M B 3
>> 18 M B 3
>> 19 M B 3
>> 20 M A 4
>>
>> Here is the reshaping I am seeking (explanation below).
>>
>> id mygroup mytime censor
>> [1,] 1 A 1 1
>> [2,] 2 B 3 0
>> [3,] 2 C 3 0
>> [4,] 2 D 6 1
>> [5,] 3 A 3 0
>> [6,] 3 B 3 1
>> [7,] 4 A 1 1
>>
>> I need to create 2 variables. The first one is a time variable.
>> Observe that for id=2, the variable mygroup=B was observed 3 times. In
>> the solution we see in row 2 that id=2 has a mytime variable of 3.
>>
>> Next, I need to create a censoring variable.
>>
>> Notice id=2 goes through has values of B, C, D for mygroup. This means
>> the change from B to C and C to D is observed. There is no change
>> from D. I need to indicate this with a 'censoring' variable. So B and
>> C would have values 0, and D would have a value of 1. As another
>> example, id=1 never changes, so I assign it censor= 1. Overall, if a
>> change is observed, 0 should be assigned, and if a change is not
>> observed 1 should be assigned.
>>
>> One potential challenge is that the original data set has over 5
>> million rows. I have ideas, but I'm still getting used the the
>> data.table and plyr syntax. I also seek a base R solution. I'll post
>> the timings on the real data set shortly.
>>
>> Thanks for your help.
>>
>> > sessionInfo()
>> R version 2.13.1 (2011-07-08)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> # Here is a simplified data set
>>
>> myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
>> "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
>> ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
>> "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
>> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
>> "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
>> "id"), class = "data.frame", row.names = c(NA, -20L))
>>
>>
>> # here is plyr solution with idata.frame
>>
>> library(plyr)
>> imyData <- idata.frame(myData)
>> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
>> mytime = length(mygroup)))
>>
>> makeCensor <- function(x) {
>> myvec <- rep(0,length(x))
>> lastInd <- length(myvec)
>> myvec[lastInd] = 1
>> myvec
>> }
>>
>>
>> plyrSolution <- ddply(timeData, "id", transform, censor =
>> makeCensor(mygroup))
>>
>>
>> # here is a data table solution
>> # use makeCensor function from above
>>
>> library(data.table)
>> mydt <- data.table(myData)
>> setkey(mydt,id,mygroup)
>>
>> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
>> makeCensor <- function(x) {
>> myvec <- rep(0,length(x))
>> lastInd <- length(myvec)
>> myvec[lastInd] = 1
>> myvec
>> }
>>
>> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
>> datatableSolution <- cbind(timeData,mycensor[,list(censor)])
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list