[R] Select the last two rows by id group
Muenchen, Robert A (Bob)
muenchen at utk.edu
Wed Mar 21 15:46:35 CET 2007
Marc, thanks for so many great variations! I especially like:
tail(sort(table(DF$County)))
I often have frequency tables that are of interest only towards the end.
Cheers,
Bob
=========================================================
Bob Muenchen (pronounced Min'-chen), Manager
Statistical Consulting Center
U of TN Office of Information Technology
200 Stokely Management Center, Knoxville, TN 37996-0520
Voice: (865) 974-5230
FAX: (865) 974-4810
Email: muenchen at utk.edu
Web: http://oit.utk.edu/scc,
News: http://listserv.utk.edu/archives/statnews.html
=========================================================
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Tuesday, March 20, 2007 8:58 PM
> To: Muenchen, Robert A (Bob)
> Cc: R-help at stat.math.ethz.ch
> Subject: Re: [R] Select the last two rows by id group
>
> On Tue, 2007-03-20 at 11:53 -0400, Muenchen, Robert A (Bob) wrote:
> > Very nice! This is almost duplicates the SAS first.var and last.var
> > ability to choose the first and last observations by group(s).
> > Substituting the head function in where Marc has the tail function
> below
> > will adapt it to the first n. It is more flexible than the SAS
> approach
> > because it can do the first/last n rather than just the single first
> or
> > last.
> >
> > Let's say we want to choose the last observation in a county, and
> > counties have duplicate names in different states. You could sort by
> > state, then county, then use only county where Marc uses score$id in
> his
> > last example below, and it would get the last record for *every*
> county
> > regardless of duplicates. Does this sound correct?
> >
> > That's a handy bit of code!
> >
> > Cheers,
> > Bob
>
> Bob,
>
> You can test it using data here:
>
> DF <- read.csv("http://www.nws.noaa.gov/nwr/SameCode.txt",
> header = FALSE)
>
> colnames(DF) <- c("Code", "County", "State")
>
> > str(DF)
> 'data.frame': 3288 obs. of 3 variables:
> $ Code : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
> $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 97 105 116
> 169 186 249 259 272 326 348 ...
> $ State : Factor w/ 60 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2
> ...
>
>
> The data is already sorted by State and then County.
>
>
> > system.time(DF.tail <- do.call("rbind", lapply(split(DF, DF$County),
> tail, 1)))
> [1] 6.851 0.085 7.085 0.000 0.000
>
>
> > str(DF.tail)
> 'data.frame': 1996 obs. of 3 variables:
> $ Code : int 45001 22001 16001 40001 55001 50001 72001 72003 72005
> 72007 ...
> $ County: Factor w/ 1996 levels "Abbeville","Acadia",..: 1 2 3 4 5 6
7
> 8 9 10 ...
> $ State : Factor w/ 60 levels "AK","AL","AR",..: 48 22 17 42 58 56 45
> 45 45 45 ...
>
>
> # How many unique county names in the source dataset?
>
> > length(unique(DF$County))
> [1] 1996
>
>
> # Are they all the same unique counties?
>
> > all(DF.tail$County == sort(unique(DF$County)))
> [1] TRUE
>
>
> It is curious to see just how many duplicates there are. For example:
>
> > tail(sort(table(DF$County)))
>
> Madison Jackson Lincoln Franklin Jefferson Washington
> 20 24 24 25 26 31
>
>
> > subset(DF, County == "Washington")
> Code County State
> 65 1129 Washington AL
> 181 5143 Washington AR
> 304 8121 Washington CO
> 385 12133 Washington FL
> 535 13303 Washington GA
> 593 16087 Washington ID
> 688 17189 Washington IL
> 783 18175 Washington IN
> 879 19183 Washington IA
> 987 20201 Washington KS
> 1106 21229 Washington KY
> 1167 22117 Washington LA
> 1189 23029 Washington ME
> 1211 24043 Washington MD
> 1393 27163 Washington MN
> 1474 28151 Washington MS
> 1590 29221 Washington MO
> 1740 31177 Washington NE
> 1883 36115 Washington NY
> 1981 37187 Washington NC
> 2124 39167 Washington OH
> 2202 40147 Washington OK
> 2239 41067 Washington OR
> 2304 42125 Washington PA
> 2313 44009 Washington RI
> 2515 47179 Washington TN
> 2759 48477 Washington TX
> 2800 49053 Washington UT
> 2814 50023 Washington VT
> 2904 51191 Washington VA
> 3108 55131 Washington WI
>
>
> # The last state with Washington County (my neighbors, the
> "Cheeseheads") was in the result set
>
> > subset(DF.tail, County == "Washington")
> Code County State
> Washington 55131 Washington WI
>
>
>
> > subset(DF, County == "Allen")
> Code County State
> 697 18003 Allen IN
> 887 20001 Allen KS
> 993 21003 Allen KY
> 1113 22003 Allen LA
> 2042 39003 Allen OH
>
>
> # The last state with Allen County (OH) was in the result set
>
> > subset(DF.tail, County == "Allen")
> Code County State
> Allen 39003 Allen OH
>
>
> Just noticed a Big Ten theme there...Go Gophers! ;-)
>
>
> So, it would seem that your hypothesis is correct, at least in this
> limited testing. I would want to validate it more rigorously of
> course.
>
> HTH,
>
> Marc Schwartz
>
More information about the R-help
mailing list