[R] Select the last two rows by id group

Wed Mar 21 01:58:21 CET 2007

On Tue, 2007-03-20 at 11:53 -0400, Muenchen, Robert A (Bob) wrote:
> Very nice! This is almost duplicates the SAS first.var and last.var
> ability to choose the first and last observations by group(s).
> Substituting the head function in where Marc has the tail function below
> will adapt it to the first n. It is more flexible than the SAS approach
> because it can do the first/last n rather than just the single first or
> last.
> 
> Let's say we want to choose the last observation in a county, and
> counties have duplicate names in different states. You could sort by
> state, then county, then use only county where Marc uses score$id in his
> last example below, and it would get the last record for *every* county
> regardless of duplicates. Does this sound correct? 
> 
> That's a handy bit of code!
> 
> Cheers,
> Bob

Bob,

You can test it using data here:

DF <- read.csv("http://www.nws.noaa.gov/nwr/SameCode.txt", 
               header = FALSE)

colnames(DF) <- c("Code", "County", "State")

> str(DF)
'data.frame':   3288 obs. of  3 variables:
 $ Code  : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
 $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 97 105 116 169 186 249 259 272 326 348 ...
 $ State : Factor w/ 60 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...

The data is already sorted by State and then County.

> system.time(DF.tail <- do.call("rbind", lapply(split(DF, DF$County), tail,  1)))
[1] 6.851 0.085 7.085 0.000 0.000

> str(DF.tail)
'data.frame':   1996 obs. of  3 variables:
 $ Code  : int  45001 22001 16001 40001 55001 50001 72001 72003 72005 72007 ...
 $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ State : Factor w/ 60 levels "AK","AL","AR",..: 48 22 17 42 58 56 45 45 45 45 ...

# How many unique county names in the source dataset?

> length(unique(DF$County))
[1] 1996

# Are they all the same unique counties?

> all(DF.tail$County == sort(unique(DF$County)))
[1] TRUE

It is curious to see just how many duplicates there are. For example:

> tail(sort(table(DF$County)))

   Madison    Jackson    Lincoln   Franklin  Jefferson Washington 
        20         24         24         25         26         31 

> subset(DF, County == "Washington")
      Code     County State
65    1129 Washington    AL
181   5143 Washington    AR
304   8121 Washington    CO
385  12133 Washington    FL
535  13303 Washington    GA
593  16087 Washington    ID
688  17189 Washington    IL
783  18175 Washington    IN
879  19183 Washington    IA
987  20201 Washington    KS
1106 21229 Washington    KY
1167 22117 Washington    LA
1189 23029 Washington    ME
1211 24043 Washington    MD
1393 27163 Washington    MN
1474 28151 Washington    MS
1590 29221 Washington    MO
1740 31177 Washington    NE
1883 36115 Washington    NY
1981 37187 Washington    NC
2124 39167 Washington    OH
2202 40147 Washington    OK
2239 41067 Washington    OR
2304 42125 Washington    PA
2313 44009 Washington    RI
2515 47179 Washington    TN
2759 48477 Washington    TX
2800 49053 Washington    UT
2814 50023 Washington    VT
2904 51191 Washington    VA
3108 55131 Washington    WI

# The last state with Washington County (my neighbors, the
"Cheeseheads") was in the result set

> subset(DF.tail, County == "Washington")
            Code     County State
Washington 55131 Washington    WI

> subset(DF, County == "Allen")
      Code County State
697  18003  Allen    IN
887  20001  Allen    KS
993  21003  Allen    KY
1113 22003  Allen    LA
2042 39003  Allen    OH

# The last state with Allen County (OH) was in the result set

> subset(DF.tail, County == "Allen")
       Code County State
Allen 39003  Allen    OH

Just noticed a Big Ten theme there...Go Gophers!   ;-)

So, it would seem that your hypothesis is correct, at least in this
limited testing.  I would want to validate it more rigorously of course.

HTH,

Marc Schwartz