[R] data manipulation and summaries with few million rows

Juliet Hannah juliet.hannah at gmail.com
Wed Aug 24 18:48:52 CEST 2011


I have a data set with about 6 million rows and 50 columns. It is a
mixture of dates, factors, and numerics.

What I am trying to accomplish can be seen with the following
simplified data, which is given as dput output below.

> head(myData)
      mydate gender mygroup id
1 2012-03-25      F       A  1
2 2005-05-23      F       B  2
3 2005-09-08      F       B  2
4 2005-12-07      F       B  2
5 2006-02-26      F       C  2
6 2006-05-13      F       C  2

For each id, I want to count the number of changes of the variable
'mygroup' that occur. For example, id=1 has 0 changes because it is
observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
need to calculate the total observation time for each id using the
variable mydate.  In the end, I am trying to have a new data set in
which each row has an id, days observed, number of changes, and
gender.

I made some simple summaries using data.table and plyr, but I'm stuck
on this reformatting.

Thanks for your help.

myData <- structure(list(mydate = c("2012-03-25", "2005-05-23", "2005-09-08",
"2005-12-07", "2006-02-26", "2006-05-13", "2006-09-01", "2006-12-12",
"2006-02-19", "2006-05-03", "2006-04-23", "2007-12-08", "2011-03-19",
"2007-12-20", "2008-06-15", "2008-12-16", "2009-06-07", "2009-10-09",
"2010-01-28", "2007-06-05"), gender = c("F", "F", "F", "F", "F",
"F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M",
"M", "M"), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D",
"D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"),
    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate", "gender",
"mygroup", "id"), class = "data.frame", row.names = c(NA, -20L
))

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



More information about the R-help mailing list