[R] How to remove rows based on frequency of factor and then difference date scores

David Winsemius dwinsemius at comcast.net
Tue Aug 24 20:17:37 CEST 2010


On Aug 24, 2010, at 1:59 PM, Abhijit Dasgupta, PhD wrote:

> The only problem with this is that Chris's unique individuals are a  
> combination of Type and ID, as I understand it. So Type=A, ID=1 is a  
> different individual from Type=B,ID=1. So we need to create a unique  
> identifier per person, simplistically by uniqueID=paste(Type, ID,  
> sep=''). Then, using this new identifier, everything follows.

I see your point. I agree that a tapply method should present both  
factors in the indices argument.

 > new.df <- txt.df[ -which( txt.df$nn <=1), ]
 > new.df <- new.df[ with(new.df, order(Type, ID) ), ]  # and possibly  
needs to be ordered?
 > new.df$diffdays <- unlist( tapply(new.df$dt2, list(new.df$ID, new.df 
$Type), function(x) x[1] -x) )
 > new.df
   Type ID       Date Value        dt2 nn diffdays
1    A  1 16/09/2020     8 2020-09-16  3        0
2    A  1 23/09/2010     9 2010-09-23  3     3646
4    B  1  13/5/2010     6 2010-05-13  3        0

But do not agree that you need, in this case at least, to create a  
paste()-y index. Agreed, however, such a construction can be useful in  
other situations.

-- 
David.
>
> On 08/24/2010 01:53 PM, David Winsemius wrote:
>>
>> On Aug 24, 2010, at 1:19 PM, Chris Beeley wrote:
>>
>>> Hello-
>>>
>>> A basic question which has nonetheless floored me entirely. I have a
>>> dataset which looks like this:
>>>
>>> Type  ID     Date            Value
>>> A       1    16/09/2020       8
>>> A       1     23/09/2010      9
>>> B       3     18/8/2010        7
>>> B       1     13/5/2010        6
>>>
>>> There are two Types, which correspond to different individuals in
>>> different conditions, and loads of ID labels (1:50) corresponding to
>>> the different individuals in each condition, and measurements at
>>> different times (from 1 to 10 measurements) for each individual.
>>>
>>> I want to perform the following operations:
>>>
>>> 1) Delete all individuals for whom only one measurement is  
>>> available.
>>> In the dataset above, you can see that I want to delete the row  
>>> Type B
>>> ID 3, and Type B ID 1, but without deleting the Type A ID 1 data
>>> because there is more than one measurement for Type A ID 1 (but not
>>> for Type B ID1)
>>>
>>> 2) Produce difference scores for each of the Dates, so each  
>>> individual
>>> (Type A ID1 and all the others for whom more than one measurement
>>> exists) starts at Date "1" and goes up in integers according to how
>>> many days have elapsed.
>>>
>>> I just know there's some incredibly cunning R-ish way of doing this
>>> but after many hours of fiddling I have had to admit defeat.
>>
>> Not sure about terribly cunning. Let's assume your dataframe was  
>> read in with stringsAsFactors=FALSE and is called txt.df:
>>
>>
>> > txt.df$dt2 <- as.Date(txt.df$Date, format="%d/%m/%Y")
>> > txt.df
>>  Type ID       Date Value        dt2
>> 1    A  1 16/09/2020     8 2020-09-16
>> 2    A  1 23/09/2010     9 2010-09-23
>> 3    B  3  18/8/2010     7 2010-08-18
>> 4    B  1  13/5/2010     6 2010-05-13
>>
>> > txt.df$nn <- ave(txt.df$ID,txt.df$ID, FUN=length)
>> > txt.df
>>  Type ID       Date Value        dt2 nn
>> 1    A  1 16/09/2020     8 2020-09-16  3
>> 2    A  1 23/09/2010     9 2010-09-23  3
>> 3    B  3  18/8/2010     7 2010-08-18  1
>> 4    B  1  13/5/2010     6 2010-05-13  3
>> > txt.df[ -which( txt.df$nn <=1), ]
>>  Type ID       Date Value        dt2 nn
>> 1    A  1 16/09/2020     8 2020-09-16  3
>> 2    A  1 23/09/2010     9 2010-09-23  3
>> 4    B  1  13/5/2010     6 2010-05-13  3
>>
>> # Task #1 accomplished
>>
>> > tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x)
>> $`1`
>> Time differences in days
>> [1]    0 3646 3779
>>
>> $`3`
>> Time difference of 0 days
>>
>> > unlist( tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) )
>>  11   12   13    3
>>   0 3646 3779    0
>> > txt.df$diffdays <- unlist( tapply(txt.df$dt2, txt.df$ID,  
>> function(x) x[1] -x) )
>> > txt.df
>>  Type ID       Date Value        dt2 nn diffdays
>> 1    A  1 16/09/2020     8 2020-09-16  3        0
>> 2    A  1 23/09/2010     9 2010-09-23  3     3646
>> 3    B  3  18/8/2010     7 2010-08-18  1     3779
>> 4    B  1  13/5/2010     6 2010-05-13  3        0
>> >
>>
>>
>>
>


David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list