[R] Select the last two rows by id group

Wed Mar 21 15:46:35 CET 2007

Marc, thanks for so many great variations! I especially like:

tail(sort(table(DF$County)))

I often have frequency tables that are of interest only towards the end.

Cheers,
Bob

=========================================================
  Bob Muenchen (pronounced Min'-chen), Manager  
  Statistical Consulting Center
  U of TN Office of Information Technology
  200 Stokely Management Center, Knoxville, TN 37996-0520
  Voice: (865) 974-5230  
  FAX:   (865) 974-4810
  Email: muenchen at utk.edu
  Web:   http://oit.utk.edu/scc, 
  News:  http://listserv.utk.edu/archives/statnews.html
=========================================================

> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Tuesday, March 20, 2007 8:58 PM
> To: Muenchen, Robert A (Bob)
> Cc: R-help at stat.math.ethz.ch
> Subject: Re: [R] Select the last two rows by id group
> 
> On Tue, 2007-03-20 at 11:53 -0400, Muenchen, Robert A (Bob) wrote:
> > Very nice! This is almost duplicates the SAS first.var and last.var
> > ability to choose the first and last observations by group(s).
> > Substituting the head function in where Marc has the tail function
> below
> > will adapt it to the first n. It is more flexible than the SAS
> approach
> > because it can do the first/last n rather than just the single first
> or
> > last.
> >
> > Let's say we want to choose the last observation in a county, and
> > counties have duplicate names in different states. You could sort by
> > state, then county, then use only county where Marc uses score$id in
> his
> > last example below, and it would get the last record for *every*
> county
> > regardless of duplicates. Does this sound correct?
> >
> > That's a handy bit of code!
> >
> > Cheers,
> > Bob
> 
> Bob,
> 
> You can test it using data here:
> 
> DF <- read.csv("http://www.nws.noaa.gov/nwr/SameCode.txt",
>                header = FALSE)
> 
> colnames(DF) <- c("Code", "County", "State")
> 
> > str(DF)
> 'data.frame':   3288 obs. of  3 variables:
>  $ Code  : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
>  $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 97 105 116
> 169 186 249 259 272 326 348 ...
>  $ State : Factor w/ 60 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2
> ...
> 
> 
> The data is already sorted by State and then County.
> 
> 
> > system.time(DF.tail <- do.call("rbind", lapply(split(DF, DF$County),
> tail,  1)))
> [1] 6.851 0.085 7.085 0.000 0.000
> 
> 
> > str(DF.tail)
> 'data.frame':   1996 obs. of  3 variables:
>  $ Code  : int  45001 22001 16001 40001 55001 50001 72001 72003 72005
> 72007 ...
>  $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 1 2 3 4 5 6
7
> 8 9 10 ...
>  $ State : Factor w/ 60 levels "AK","AL","AR",..: 48 22 17 42 58 56 45
> 45 45 45 ...
> 
> 
> # How many unique county names in the source dataset?
> 
> > length(unique(DF$County))
> [1] 1996
> 
> 
> # Are they all the same unique counties?
> 
> > all(DF.tail$County == sort(unique(DF$County)))
> [1] TRUE
> 
> 
> It is curious to see just how many duplicates there are. For example:
> 
> > tail(sort(table(DF$County)))
> 
>    Madison    Jackson    Lincoln   Franklin  Jefferson Washington
>         20         24         24         25         26         31
> 
> 
> > subset(DF, County == "Washington")
>       Code     County State
> 65    1129 Washington    AL
> 181   5143 Washington    AR
> 304   8121 Washington    CO
> 385  12133 Washington    FL
> 535  13303 Washington    GA
> 593  16087 Washington    ID
> 688  17189 Washington    IL
> 783  18175 Washington    IN
> 879  19183 Washington    IA
> 987  20201 Washington    KS
> 1106 21229 Washington    KY
> 1167 22117 Washington    LA
> 1189 23029 Washington    ME
> 1211 24043 Washington    MD
> 1393 27163 Washington    MN
> 1474 28151 Washington    MS
> 1590 29221 Washington    MO
> 1740 31177 Washington    NE
> 1883 36115 Washington    NY
> 1981 37187 Washington    NC
> 2124 39167 Washington    OH
> 2202 40147 Washington    OK
> 2239 41067 Washington    OR
> 2304 42125 Washington    PA
> 2313 44009 Washington    RI
> 2515 47179 Washington    TN
> 2759 48477 Washington    TX
> 2800 49053 Washington    UT
> 2814 50023 Washington    VT
> 2904 51191 Washington    VA
> 3108 55131 Washington    WI
> 
> 
> # The last state with Washington County (my neighbors, the
> "Cheeseheads") was in the result set
> 
> > subset(DF.tail, County == "Washington")
>             Code     County State
> Washington 55131 Washington    WI
> 
> 
> 
> > subset(DF, County == "Allen")
>       Code County State
> 697  18003  Allen    IN
> 887  20001  Allen    KS
> 993  21003  Allen    KY
> 1113 22003  Allen    LA
> 2042 39003  Allen    OH
> 
> 
> # The last state with Allen County (OH) was in the result set
> 
> > subset(DF.tail, County == "Allen")
>        Code County State
> Allen 39003  Allen    OH
> 
> 
> Just noticed a Big Ten theme there...Go Gophers!   ;-)
> 
> 
> So, it would seem that your hypothesis is correct, at least in this
> limited testing.  I would want to validate it more rigorously of
> course.
> 
> HTH,
> 
> Marc Schwartz
>