[R] data manipulation and summaries with few million rows
jim holtman
jholtman at gmail.com
Sun Aug 28 00:52:38 CEST 2011
Factors are you friend here:
> myData
mydate gender mygroup id mygrp.f
1 2012-03-25 F A 1 1
2 2005-05-23 F B 2 2
3 2005-09-08 F B 2 2
4 2005-12-07 F B 2 2
5 2006-02-26 F C 2 3
6 2006-05-13 F C 2 3
7 2006-09-01 F C 2 3
8 2006-12-12 F D 2 4
9 2006-02-19 F D 2 4
10 2006-05-03 F D 2 4
11 2006-04-23 F D 2 4
12 2007-12-08 F D 2 4
13 2011-03-19 F D 2 4
14 2007-12-20 M A 3 1
15 2008-06-15 M A 3 1
16 2008-12-16 M A 3 1
17 2009-06-07 M B 3 2
18 2009-10-09 M B 3 2
19 2010-01-28 M B 3 2
20 2007-06-05 M A 4 1
> # change 'mygroup' to a factor so you can use 'diff' to count the changes
> myData$mygrp.f <- as.integer(factor(myData$mygroup))
> # count the changes for each 'id'
> changes <- tapply(myData$mygrp.f, myData$id, function(x){
+ sum(diff(x) != 0)
+ })
>
>
> changes
1 2 3 4
0 2 1 0
>
On Wed, Aug 24, 2011 at 12:48 PM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
> I have a data set with about 6 million rows and 50 columns. It is a
> mixture of dates, factors, and numerics.
>
> What I am trying to accomplish can be seen with the following
> simplified data, which is given as dput output below.
>
>> head(myData)
> mydate gender mygroup id
> 1 2012-03-25 F A 1
> 2 2005-05-23 F B 2
> 3 2005-09-08 F B 2
> 4 2005-12-07 F B 2
> 5 2006-02-26 F C 2
> 6 2006-05-13 F C 2
>
> For each id, I want to count the number of changes of the variable
> 'mygroup' that occur. For example, id=1 has 0 changes because it is
> observed only once. id=2 has 2 changes (B to C, and C to D). I also
> need to calculate the total observation time for each id using the
> variable mydate. In the end, I am trying to have a new data set in
> which each row has an id, days observed, number of changes, and
> gender.
>
> I made some simple summaries using data.table and plyr, but I'm stuck
> on this reformatting.
>
> Thanks for your help.
>
> myData <- structure(list(mydate = c("2012-03-25", "2005-05-23", "2005-09-08",
> "2005-12-07", "2006-02-26", "2006-05-13", "2006-09-01", "2006-12-12",
> "2006-02-19", "2006-05-03", "2006-04-23", "2007-12-08", "2011-03-19",
> "2007-12-20", "2008-06-15", "2008-12-16", "2009-06-07", "2009-10-09",
> "2010-01-28", "2007-06-05"), gender = c("F", "F", "F", "F", "F",
> "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M",
> "M", "M"), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D",
> "D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"),
> id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
> 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate", "gender",
> "mygroup", "id"), class = "data.frame", row.names = c(NA, -20L
> ))
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
More information about the R-help
mailing list