[R] how to subset based on other row values and multiplicity

Wed Jul 16 15:51:20 CEST 2014

I can reproduce what you requested, but there was the question about
what happens with the multiple 'c-y' values.

====================

> require(data.table)
> x <- read.table(text = 'id   date value
+ a    2000-01-01 x
+ a    2000-03-01 x
+ b    2000-11-11 w
+ c    2000-11-11 y
+ c    2000-10-01 y
+ c    2000-09-10 y
+ c    2000-12-12 z
+ c    2000-10-11 z
+ d    2000-11-11 w
+ d    2000-11-10 w', as.is = TRUE, header = TRUE)
> setDT(x)
> x[, date := as.Date(date)]
> setkey(x, id, value, date)
>
> y <- x[
+     , {
+         if (.N == 1) val <- NULL  # only one -- delete
+         else {
+             dif <- difftime(tail(date, -1), head(date, -1), units = 'days')
+             # return first value if any > 31
+             if (any(dif >= 31)) val <- list(date = date[1L])
+             else val <- NULL
+         }
+         val
+       }
+     , keyby = 'id,value'
+     ]
> y
   id value       date
1:  a     x 2000-01-01
2:  c     y 2000-09-10
3:  c     z 2000-10-11

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Wed, Jul 16, 2014 at 9:25 AM, John McKown
<john.archie.mckown at gmail.com> wrote:
> On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
> <Scott.Williams at petermac.org> wrote:
>> Hi R experts,
>>
>> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
>> in an individual (Œid¹) if they occur
>> more than once at least 30 days apart.
>>
>>
>> id   date value
>> a    2000-01-01 x
>> a    2000-03-01 x
>> b    2000-11-11 w
>> c    2000-11-11 y
>> c    2000-10-01 y
>> c    2000-09-10 y
>> c    2000-12-12 z
>> c    2000-10-11 z
>> d    2000-11-11 w
>> d    2000-11-10 w
>>
>>
>> I wish to subset the data to retain rows where the value for the
>> individual is confirmed more than 30 days apart. So, after deleting all
>> rows with just one occurrence of id and value, the rest would be the
>> earliest occurrence of each value in each case id, provided 31 or more
>> days exist between the dates. If >1 value is present per id, each value
>> level needs to be assessed independently. This example would then reduce
>> to:
>>
>>
>> id   date           value
>> a    2000-01-01 x
>> c    2000-09-10 y
>> c    2000-10-11 z
>
> Question: the c-y id-value pair occurs 3 times. In two cases
> (2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
> is >30 days. Why isn't
> c 2000-10-01 y
> also part of the result? Is it because you only want a single id-value
> pair in which the date is the minimal? Or you want the one in which
> the date difference is maximal? Or you overlooked that particular
> match? I can't figure it out from your description.
>
>>
>>
>>
>> I can do this via some crude loops and subsetting, but I am looking for as
>> much efficiency as possible
>> as the dataset has around 50 million rows to assess. Any suggestions
>> welcomed.
>
> Hum, is the source of this data in a relational database such as
> Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
> because some of this processing might be easier do to in the data base
> using a "self join", instead of reading the entire relational table
> into a data.frame and doing it in R.
>
>>
>> Thanks in advance
>>
>> Scott Williams MD
>> Melbourne, Australia
>>
>
> --
> There is nothing more pleasant than traveling and meeting new people!
> Genghis Khan
>
> Maranatha! <><
> John McKown
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.