[R] subsets
Peter Ehlers
ehlers at ucalgary.ca
Thu Jan 20 14:56:47 CET 2011
On 2011-01-20 02:05, Taras Zakharko wrote:
> Hello Den,
>
> your problem is not as it may seem so Ivan's suggestion is only a partial answer. I see that each patient can have
> more then one diagnosis and I take that you want to isolate patients based on particular conditions.
> Thus, simply looking for "ah" or "idh" as Ivan suggests will yield patients which can have either of those but not
> necessarily patients that have both.
>
> Instead, what one must do is apply the condition to the whole set of diagnosis associated with each patient.
> I think that its done best with the aggregate function. This function splits the data according to some
> factor (in our case it will be the patient id) and performs a routine on each subset (in our case it will be
> a condition test):
>
>
> ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& "ihd" %in% x)
> ids<- aggregate(diagnosis ~ id, df, function(x) "ah" %in% x&& !"ihd" %in% x)
> ids<- aggregate(diagnosis ~ id, df, function(x) ! "ah" %in% x&& "ihd" %in% x)
>
> Now, ids will contain a data frame like:
>
> id diagnosis
> 1 TRUE
> 2 FALSE
> 3 FALSE
> ...
>
> which shows which patients have the set of diagnoses you asked for. You can then apply these
> patients to the original data by something like:
>
> subset(df, id %in% subset(ids, diagnosis == TRUE)$id)
>
> this will extract only patients from the 'ids' data frame for which the diagnosis applies and then extract the associated
> diagnosis sets from the original 'df' data frame.
>
> Hope it helps,
>
> Taras
Here's a tidy version using the plyr package:
require(plyr)
df1 <- ddply(df, .(id), summarize,
has.both = ("ah" %in% diagnosis) & ("ihd" %in% diagnosis),
has.only.ah = ("ah" %in% diagnosis) & !("ihd" %in% diagnosis),
has.only.ihd = !("ah" %in% diagnosis) & ("ihd" %in% diagnosis)
)
Further processing on the columns of df1 is straightforward.
Peter Ehlers
> On Jan 20, 2011, at 9:53 , Den wrote:
>
>> Dear R people
>> Could you please help.
>>
>> Basically, there are two variables in my data set. Each patient ('id')
>> may have one or more diseases ('diagnosis'). It looks like
>>
>> id diagnosis
>> 1 ah
>> 2 ah
>> 2 ihd
>> 2 im
>> 3 ah
>> 3 stroke
>> 4 ah
>> 4 ihd
>> 4 angina
>> 5 ihd
>> ..............
>> Q: How to make three data sets:
>> 1. Patients with ah and ihd
>> 2. Patients with ah but no ihd
>> 3. Patients with ihd but no ah?
>>
>> If you have any ideas could just guide what should I look for. Is a
>> subset or aggregate, or loops, or something else??? I am a bit lost. (F1
>> F1 F1 !!!:)
>> Thank you
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list