[R] Select the last two rows by id group
Marc Schwartz
marc_schwartz at comcast.net
Wed Mar 21 01:58:21 CET 2007
On Tue, 2007-03-20 at 11:53 -0400, Muenchen, Robert A (Bob) wrote:
> Very nice! This is almost duplicates the SAS first.var and last.var
> ability to choose the first and last observations by group(s).
> Substituting the head function in where Marc has the tail function below
> will adapt it to the first n. It is more flexible than the SAS approach
> because it can do the first/last n rather than just the single first or
> last.
>
> Let's say we want to choose the last observation in a county, and
> counties have duplicate names in different states. You could sort by
> state, then county, then use only county where Marc uses score$id in his
> last example below, and it would get the last record for *every* county
> regardless of duplicates. Does this sound correct?
>
> That's a handy bit of code!
>
> Cheers,
> Bob
Bob,
You can test it using data here:
DF <- read.csv("http://www.nws.noaa.gov/nwr/SameCode.txt",
header = FALSE)
colnames(DF) <- c("Code", "County", "State")
> str(DF)
'data.frame': 3288 obs. of 3 variables:
$ Code : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
$ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 97 105 116 169 186 249 259 272 326 348 ...
$ State : Factor w/ 60 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
The data is already sorted by State and then County.
> system.time(DF.tail <- do.call("rbind", lapply(split(DF, DF$County), tail, 1)))
[1] 6.851 0.085 7.085 0.000 0.000
> str(DF.tail)
'data.frame': 1996 obs. of 3 variables:
$ Code : int 45001 22001 16001 40001 55001 50001 72001 72003 72005 72007 ...
$ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 1 2 3 4 5 6 7 8 9 10 ...
$ State : Factor w/ 60 levels "AK","AL","AR",..: 48 22 17 42 58 56 45 45 45 45 ...
# How many unique county names in the source dataset?
> length(unique(DF$County))
[1] 1996
# Are they all the same unique counties?
> all(DF.tail$County == sort(unique(DF$County)))
[1] TRUE
It is curious to see just how many duplicates there are. For example:
> tail(sort(table(DF$County)))
Madison Jackson Lincoln Franklin Jefferson Washington
20 24 24 25 26 31
> subset(DF, County == "Washington")
Code County State
65 1129 Washington AL
181 5143 Washington AR
304 8121 Washington CO
385 12133 Washington FL
535 13303 Washington GA
593 16087 Washington ID
688 17189 Washington IL
783 18175 Washington IN
879 19183 Washington IA
987 20201 Washington KS
1106 21229 Washington KY
1167 22117 Washington LA
1189 23029 Washington ME
1211 24043 Washington MD
1393 27163 Washington MN
1474 28151 Washington MS
1590 29221 Washington MO
1740 31177 Washington NE
1883 36115 Washington NY
1981 37187 Washington NC
2124 39167 Washington OH
2202 40147 Washington OK
2239 41067 Washington OR
2304 42125 Washington PA
2313 44009 Washington RI
2515 47179 Washington TN
2759 48477 Washington TX
2800 49053 Washington UT
2814 50023 Washington VT
2904 51191 Washington VA
3108 55131 Washington WI
# The last state with Washington County (my neighbors, the
"Cheeseheads") was in the result set
> subset(DF.tail, County == "Washington")
Code County State
Washington 55131 Washington WI
> subset(DF, County == "Allen")
Code County State
697 18003 Allen IN
887 20001 Allen KS
993 21003 Allen KY
1113 22003 Allen LA
2042 39003 Allen OH
# The last state with Allen County (OH) was in the result set
> subset(DF.tail, County == "Allen")
Code County State
Allen 39003 Allen OH
Just noticed a Big Ten theme there...Go Gophers! ;-)
So, it would seem that your hypothesis is correct, at least in this
limited testing. I would want to validate it more rigorously of course.
HTH,
Marc Schwartz
More information about the R-help
mailing list