[R] data manipulation and summaries with few million rows

Wed Aug 24 19:32:28 CEST 2011

Thanks Dennis! I'll check this out.

Just to clarify, I need the total number of switches/changes
regardless of if that state
had occurred in the past. So A-A-B-A, would have 2 changes: A to B and B to A.

Thanks again.

On Wed, Aug 24, 2011 at 1:28 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi Juliet:
>
> Here's a Q & D solution:
>
> # (1) plyr
>> f <- function(d) length(unique(d$mygroup)) - 1
>> ddply(myData, .(id), f)
>  id V1
> 1  1  0
> 2  2  2
> 3  3  1
> 4  4  0
>
> # (2) data.table
>
> myDT <- data.table(myData, key = 'id')
> myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id']
>
> If one can switch back and forth between levels more than once, then
> the above is clearly not appropriate. A more robust method would be to
> employ rle() [run length encoding]:
>
> g <- function(d) length(rle(d$mygroup)$lengths) - 1
> ddply(myData, .(id), g)    # gives the same answer as above
> myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id']   # ditto
>
>
> HTH,
> Dennis
>
> On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
>> I have a data set with about 6 million rows and 50 columns. It is a
>> mixture of dates, factors, and numerics.
>>
>> What I am trying to accomplish can be seen with the following
>> simplified data, which is given as dput output below.
>>
>>> head(myData)
>>      mydate gender mygroup id
>> 1 2012-03-25      F       A  1
>> 2 2005-05-23      F       B  2
>> 3 2005-09-08      F       B  2
>> 4 2005-12-07      F       B  2
>> 5 2006-02-26      F       C  2
>> 6 2006-05-13      F       C  2
>>
>> For each id, I want to count the number of changes of the variable
>> 'mygroup' that occur. For example, id=1 has 0 changes because it is
>> observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
>> need to calculate the total observation time for each id using the
>> variable mydate.  In the end, I am trying to have a new data set in
>> which each row has an id, days observed, number of changes, and
>> gender.
>>
>> I made some simple summaries using data.table and plyr, but I'm stuck
>> on this reformatting.
>>
>> Thanks for your help.
>>
>> myData <- structure(list(mydate = c("2012-03-25", "2005-05-23", "2005-09-08",
>> "2005-12-07", "2006-02-26", "2006-05-13", "2006-09-01", "2006-12-12",
>> "2006-02-19", "2006-05-03", "2006-04-23", "2007-12-08", "2011-03-19",
>> "2007-12-20", "2008-06-15", "2008-12-16", "2009-06-07", "2009-10-09",
>> "2010-01-28", "2007-06-05"), gender = c("F", "F", "F", "F", "F",
>> "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M",
>> "M", "M"), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D",
>> "D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"),
>>    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate", "gender",
>> "mygroup", "id"), class = "data.frame", row.names = c(NA, -20L
>> ))
>>
>>> sessionInfo()
>> R version 2.13.1 (2011-07-08)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>