[R] Selecting n observation
William Dunlap
wdunlap at tibco.com
Fri Oct 12 21:07:18 CEST 2012
> do.call("rbind",
> by(df, INDICES=df$ID, FUN=function(DF) tail(DF, 2) ))
Another way to approach this sort of problem is to use ave() to
assign a within-group sequence number to each row and then
select the rows with the sequence numbers you want. You can
also use ave() to make a column giving the size of the group that
each item is in. Hence you can select things like "the last 2 items
in each category that had at least 3 items".
E.g., here is a function to generate data on visits of patients to
a clinic, where the visits are listed in time order.
makeData <- function(nVisits, Doctors=paste("Dr.",LETTERS[1:2]), Patients=101:104, seed = 1)
{
if (!is.null(seed)) set.seed(seed)
data.frame(Doctor=sample(Doctors, replace=TRUE, nVisits),
Patient=sample(Patients, replace=TRUE, nVisits),
Date=as.Date("2004-01-01")+sort(sample(2000, replace=TRUE, nVisits)))
}
# Make a 12-row dataset
d <- makeData(12)
# Add columns describing the visits between each doctor/patient pair
d1 <- within(d, { N=ave(integer(length(Date)), Doctor, Patient, FUN=length)
Seq=ave(integer(length(Date)), Doctor, Patient, FUN=seq_along)})
d1
# Doctor Patient Date Seq N
# 1 Dr. A 103 2004-01-28 1 3
# 2 Dr. A 102 2005-01-08 1 1
# 3 Dr. B 104 2005-06-19 1 4
# 4 Dr. B 102 2005-11-12 1 2
# 5 Dr. A 103 2006-02-04 2 3
# 6 Dr. B 104 2006-02-12 2 4
# 7 Dr. B 102 2006-08-23 2 2
# 8 Dr. B 104 2006-09-15 3 4
# 9 Dr. B 104 2007-04-15 4 4
# 10 Dr. A 101 2007-08-30 1 2
# 11 Dr. A 103 2008-07-13 3 3
# 12 Dr. A 101 2008-10-06 2 2
# Show the last visit in each doctor/patient group
d[d1$Seq==d1$N, ]
# Doctor Patient Date
# 2 Dr. A 102 2005-01-08
# 7 Dr. B 102 2006-08-23
# 9 Dr. B 104 2007-04-15
# 11 Dr. A 103 2008-07-13
# 12 Dr. A 101 2008-10-06
# Show last 2 visits, but only if there were at least 2 visits
d[d1$Seq>d1$N-2 & d1$N>=2, ]
# Doctor Patient Date
# 4 Dr. B 102 2005-11-12
# 5 Dr. A 103 2006-02-04
# 7 Dr. B 102 2006-08-23
# 8 Dr. B 104 2006-09-15
# 9 Dr. B 104 2007-04-15
# 10 Dr. A 101 2007-08-30
# 11 Dr. A 103 2008-07-13
# 12 Dr. A 101 2008-10-06
# Show the amount of time beteen the last two visits in a group (if there were at least 2 visits)
d[d1$Seq==d1$N & d1$N>=2, "Date"] - d[d1$Seq==d1$N-1 & d1$N>=2, "Date"]
# Time differences in days
# [1] 284 435 667 403
I find it easier to formulate the queries with this method. For large
datasets, selecting rows according a criterion can be a lot
faster than splitting a data.frame into many parts, processing
them with tail, and combining them again.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of David Winsemius
> Sent: Thursday, October 11, 2012 2:13 PM
> To: bibek sharma
> Cc: r-help at r-project.org
> Subject: Re: [R] Selecting n observation
>
>
> On Oct 11, 2012, at 12:48 PM, bibek sharma wrote:
>
> > Hello R help,
> > I have a question similar to what is posted by someone before. my
> > problem is that Instead of last assessment, I want to choose last two.
> >
> > I have a data set with several time assessments for each participant.
> > I want to select the last assessment for each participant. My dataset
> > looks like this:
> > ID week outcome
> > 1 2 14
> > 1 4 28
> > 1 6 42
> > 4 2 14
> > 4 6 46
> > 4 9 64
> > 4 9 71
> > 4 12 85
> > 9 2 14
> > 9 4 28
> > 9 6 51
> > 9 9 66
> > 9 12 84
> >
> > Here is one solution for choosing last assessment
> > do.call("rbind",
> > by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
>
> Why wouldn't the solution be something along the lines of:
>
> do.call("rbind",
> by(df, INDICES=df$ID, FUN=function(DF) tail(DF, 2) ))
>
>
> > ID week outcome
> > 1 1 6 42
> > 4 4 12 85
> > 9 9 12 84
> >
> >
>
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list