[R] how to subset based on other row values and multiplicity
William Dunlap
wdunlap at tibco.com
Wed Jul 16 21:48:10 CEST 2014
Using base R you can solve this by doing some sorting and comparing
the first and last dates in each id-value group. Computing the last
and last dates can be vectorized.
f1 <- function(data) {
# sort by id, break ties with value, break remaining ties with date
sortedData <- data[with(data, order(id, value, date)), ]
i <- seq_len(NROW(sortedData)-1)
# a 'group' has same id and value, entries in group are sorted by date
isBreakPoint <- with(sortedData, id[i]!=id[i+1] | value[i]!=value[i+1])
isFirstInGroup <- c(TRUE, isBreakPoint)
isLastInGroup <- c(isBreakPoint, TRUE)
sortedData[isFirstInGroup,][sortedData[isLastInGroup,"date"] -
sortedData[isFirstInGroup,"date"] >= 31,]
}
dat <- read.table(colClasses=c("character", "Date", "character"),
header=TRUE, text=
"id date value
a 2000-01-01 x
a 2000-03-01 x
b 2000-11-11 w
c 2000-11-11 y
c 2000-10-01 y
c 2000-09-10 y
c 2000-12-12 z
c 2000-10-11 z
d 2000-11-11 w
d 2000-11-10 w")
> f1(dat)
id date value
1 a 2000-01-01 x
6 c 2000-09-10 y
8 c 2000-10-11 z
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> # id date value
> #1 a 2000-01-01 x
> #2 c 2000-09-10 y
> #3 c 2000-10-11 z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <Scott.Williams at petermac.org> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id date value
> a 2000-01-01 x
> a 2000-03-01 x
> b 2000-11-11 w
> c 2000-11-11 y
> c 2000-10-01 y
> c 2000-09-10 y
> c 2000-12-12 z
> c 2000-10-11 z
> d 2000-11-11 w
> d 2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id date value
> a 2000-01-01 x
> c 2000-09-10 y
> c 2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee. If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email. Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list