[R] Selecting n observation

William Dunlap wdunlap at tibco.com
Fri Oct 12 21:07:18 CEST 2012


> do.call("rbind",
>        by(df, INDICES=df$ID, FUN=function(DF) tail(DF, 2) ))

Another way to approach this sort of problem is to use ave() to
assign a within-group sequence number to each row and then
select the rows with the sequence numbers you want.  You can
also use ave() to make a column giving the size of the group that
each item is in.  Hence you can select things like "the last 2 items
in each category that had at least 3 items".

E.g., here is a function to generate data on visits of patients to
a clinic, where the visits are listed in time order.

makeData <- function(nVisits, Doctors=paste("Dr.",LETTERS[1:2]), Patients=101:104, seed = 1)
{
    if (!is.null(seed)) set.seed(seed)
    data.frame(Doctor=sample(Doctors, replace=TRUE, nVisits),
               Patient=sample(Patients, replace=TRUE, nVisits),
               Date=as.Date("2004-01-01")+sort(sample(2000, replace=TRUE, nVisits)))
}
# Make a 12-row dataset
d <- makeData(12)
# Add columns describing the visits between each doctor/patient pair
d1 <- within(d, { N=ave(integer(length(Date)), Doctor, Patient, FUN=length)
                    Seq=ave(integer(length(Date)), Doctor, Patient, FUN=seq_along)})
d1
#    Doctor Patient       Date Seq N
# 1   Dr. A     103 2004-01-28   1 3
# 2   Dr. A     102 2005-01-08   1 1
# 3   Dr. B     104 2005-06-19   1 4
# 4   Dr. B     102 2005-11-12   1 2
# 5   Dr. A     103 2006-02-04   2 3
# 6   Dr. B     104 2006-02-12   2 4
# 7   Dr. B     102 2006-08-23   2 2
# 8   Dr. B     104 2006-09-15   3 4
# 9   Dr. B     104 2007-04-15   4 4
# 10  Dr. A     101 2007-08-30   1 2
# 11  Dr. A     103 2008-07-13   3 3
# 12  Dr. A     101 2008-10-06   2 2

# Show the last visit in each doctor/patient group
d[d1$Seq==d1$N, ]
#    Doctor Patient       Date
# 2   Dr. A     102 2005-01-08
# 7   Dr. B     102 2006-08-23
# 9   Dr. B     104 2007-04-15
# 11  Dr. A     103 2008-07-13
# 12  Dr. A     101 2008-10-06

# Show last 2 visits, but only if there were at least 2 visits
d[d1$Seq>d1$N-2 & d1$N>=2, ]
#    Doctor Patient       Date
# 4   Dr. B     102 2005-11-12
# 5   Dr. A     103 2006-02-04
# 7   Dr. B     102 2006-08-23
# 8   Dr. B     104 2006-09-15
# 9   Dr. B     104 2007-04-15
# 10  Dr. A     101 2007-08-30
# 11  Dr. A     103 2008-07-13
# 12  Dr. A     101 2008-10-06

# Show the amount of time beteen the last two visits in a group (if there were at least 2 visits)
d[d1$Seq==d1$N & d1$N>=2, "Date"] - d[d1$Seq==d1$N-1 & d1$N>=2, "Date"]
# Time differences in days
# [1] 284 435 667 403

I find it easier to formulate the queries with this method.  For large
datasets, selecting rows according a criterion can be a lot
faster than splitting a data.frame into many parts, processing
them with tail, and combining them again.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of David Winsemius
> Sent: Thursday, October 11, 2012 2:13 PM
> To: bibek sharma
> Cc: r-help at r-project.org
> Subject: Re: [R] Selecting n observation
> 
> 
> On Oct 11, 2012, at 12:48 PM, bibek sharma wrote:
> 
> > Hello R help,
> > I have a question similar to what is posted by someone before. my
> > problem is that Instead of last assessment, I want to choose last two.
> >
> > I have a data set with several time assessments for each participant.
> > I want to select the last assessment for each participant. My dataset
> > looks like this:
> > ID  week  outcome
> > 1   2   14
> > 1   4   28
> > 1   6   42
> > 4   2   14
> > 4   6   46
> > 4   9   64
> > 4   9   71
> > 4  12   85
> > 9   2   14
> > 9   4   28
> > 9   6   51
> > 9   9   66
> > 9  12   84
> >
> > Here is one solution for choosing last assessment
> > do.call("rbind",
> >        by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
> 
> Why wouldn't the solution be something along the lines of:
> 
> do.call("rbind",
>        by(df, INDICES=df$ID, FUN=function(DF) tail(DF, 2) ))
> 
> 
> >  ID week outcome
> > 1  1    6      42
> > 4  4   12      85
> > 9  9   12      84
> >
> >
> 
> 
> David Winsemius, MD
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list