[R] how to subset based on other row values and multiplicity

Williams Scott Scott.Williams at petermac.org
Wed Jul 16 16:11:08 CEST 2014


Thanks guys - amazingly prompt solutions from the R community as always.

Yes, the c-y value reverts to just the first date event - the spirit of
this is that I am trying to identify and confirm a list of diagnoses that
a patient has coded in government administrative data. Once a diagnosis is
made and confirmed, I am not interested in whether it is listed again and
again later on. I just need that date at which it first became apparent.
So in the multiple c-y case, the min date is the correct one. Some cases
will have the same diagnosis listed dozens of times, hence the very
bloated dataset.

Time to churn through the data is not a big issue, so I will have a go
with Jim¹s neat code he just sent on perhaps a few thousand rows and see
how I get on. 

S



On 17/07/2014 12:09 am, "John McKown" <john.archie.mckown at gmail.com> wrote:

>On Wed, Jul 16, 2014 at 8:51 AM, jim holtman <jholtman at gmail.com> wrote:
>> I can reproduce what you requested, but there was the question about
>> what happens with the multiple 'c-y' values.
>>
>> ====================
>>
>>> require(data.table)
>>> x <- read.table(text = 'id   date value
>> + a    2000-01-01 x
>> + a    2000-03-01 x
>> + b    2000-11-11 w
>> + c    2000-11-11 y
>> + c    2000-10-01 y
>> + c    2000-09-10 y
>> + c    2000-12-12 z
>> + c    2000-10-11 z
>> + d    2000-11-11 w
>> + d    2000-11-10 w', as.is = TRUE, header = TRUE)
>>> setDT(x)
>>> x[, date := as.Date(date)]
>>> setkey(x, id, value, date)
>>>
>>> y <- x[
>> +     , {
>> +         if (.N == 1) val <- NULL  # only one -- delete
>> +         else {
>> +             dif <- difftime(tail(date, -1), head(date, -1), units =
>>'days')
>> +             # return first value if any > 31
>> +             if (any(dif >= 31)) val <- list(date = date[1L])
>> +             else val <- NULL
>> +         }
>> +         val
>> +       }
>> +     , keyby = 'id,value'
>> +     ]
>>> y
>>    id value       date
>> 1:  a     x 2000-01-01
>> 2:  c     y 2000-09-10
>> 3:  c     z 2000-10-11
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>
>Wow, I picked up a couple of _nice_ techniques from that one post!
>Looks like "data.table" will let me do SQL like things in R. I have a
>warped brain. I think in "result sets" and "matrix operations"
>
>Many thanks.
>
>-- 
>There is nothing more pleasant than traveling and meeting new people!
>Genghis Khan
>
>Maranatha! <><
>John McKown


This email (including any attachments or links) may contain 
confidential and/or legally privileged information and is 
intended only to be read or used by the addressee.  If you 
are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly 
prohibited.  
Confidentiality and legal privilege attached to this email 
(including any attachments) are not waived or lost by 
reason of its mistaken delivery to you.
If you have received this email in error, please delete it 
and notify us immediately by telephone or email.  Peter 
MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been 
intercepted or altered and will not be liable for any delay 
in its receipt.



More information about the R-help mailing list