[R] how to subset based on other row values and multiplicity

William Dunlap wdunlap at tibco.com
Wed Jul 16 21:48:10 CEST 2014


Using base R you can solve this by doing some sorting and comparing
the first and last dates in each id-value group.  Computing the last
and last dates can be vectorized.

f1 <- function(data) {
    # sort by id, break ties with value, break remaining ties with date
    sortedData <- data[with(data, order(id, value, date)), ]
    i <- seq_len(NROW(sortedData)-1)
    # a 'group' has same id and value, entries in group are sorted by date
    isBreakPoint <- with(sortedData, id[i]!=id[i+1] | value[i]!=value[i+1])
    isFirstInGroup <- c(TRUE, isBreakPoint)
    isLastInGroup <- c(isBreakPoint, TRUE)
    sortedData[isFirstInGroup,][sortedData[isLastInGroup,"date"] -
sortedData[isFirstInGroup,"date"] >= 31,]
}
dat <- read.table(colClasses=c("character", "Date", "character"),
header=TRUE, text=
"id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w")

> f1(dat)
  id       date value
1  a 2000-01-01     x
6  c 2000-09-10     y
8  c 2000-10-11     z

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> #  id       date value
> #1  a 2000-01-01     x
> #2  c 2000-09-10     y
> #3  c 2000-10-11     z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <Scott.Williams at petermac.org> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list