[R] Removing & generating data by category

Thu Oct 29 12:30:43 CET 2009

Here is another way based on pasting ids as hinted below:

a <- data.frame(id=c(c("A1","A2","A3","A4","A5"),
                    c("A3","A2","A3","A4","A5")),
                    loc=c("B1","B2","B3","B4","B5"),
                    clm=c(rep(("General"),6),rep("Life",4)))

a$uid <- paste(a$id, ".", a$loc, sep="")

out <- tapply( a$clm, a$uid, paste ) # can also add collapse=","
$A1.B1
[1] "General"

$A2.B2
[1] "General" "Life"

$A3.B1
[1] "General"

$A3.B3
[1] "General" "Life"

$A4.B4
[1] "General" "Life"

$A5.B5
[1] "General" "Life"

Then here are those with single policies.

 > out[ which( sapply(out, length) == 1 ) ]
$A1.B1
[1] "General"

$A3.B1
[1] "General"

David Winsemius wrote:
> On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:
> 
>> Dear R users,
>>
>>
>> Basically, from the following arbitrary data set:
>>
>> a <-
>> data
>> .frame
>> (id
>> =
>> c
>> (c
>> ("A1
>> ","A2
>> ","A3
>> ","A4
>> ","A5
>> "),c
>> ("A3
>> ","A2
>> ","A3
>> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"), 
>> 6),rep("Life",4)))
>>
>>> a
>>    id   loc  clm
>> 1  A1  B1 General
>> 2  A2  B2 General
>> 3  A3  B3 General
>> 4  A4  B4 General
>> 5  A5  B5 General
>> 6  A3  B1 General
>> 7  A2  B2    Life
>> 8  A3  B3    Life
>> 9  A4  B4    Life
>> 10 A5  B5    Life
>>
>> I desire removing records (highlighted records above) with identical  
>> values
>> in each fields ("id" & "loc") but with different value of "clm" (i.e
>> according to category)
> 
> Take a look at this merge operation on separate rows of "a".
> 
>  > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id",  
> "loc"), all=T)
>    id loc clm.x   clm.y
> 1 A1  B1  <NA> General
> 2 A2  B2  Life General
> 3 A3  B1  <NA> General
> 4 A3  B3  Life General
> 5 A4  B4  Life General
> 6 A5  B5  Life General
> 
> Assignment of that object and selection with is.na should complete the  
> process.
> 
>  > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] ,  
> by=c("id", "loc"), all=T)
> 
>  > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ]
>    id loc clm.x   clm.y
> 1 A1  B1  <NA> General
> 3 A3  B1  <NA> General
> 
> Alternate methods might include paste-ing id to loc and removing  
> duplicates.
> 
> 
>> i.e
>>> categ <- table(a$id,a$clm)
>>> categ
>>     General Life
>>  A1       1    0
>>  A2       1    1
>>  A3       2    1
>>  A4       1    1
>>  A5       1    1
>>
>> The desired output is
>>
>>    id   loc  clm
>> 1  A1  B1 General
>> 6  A3  B1 General
>>
>> Because the data set I am working on is quite big (~ 800,000 x 20)
>> with majority of the fields values being long strings, looping  
>> turned out to
>> be very inefficient in comapring individual rows..
>>
>> Are there any alternative efficient methods in implementing this  
>> problem?
>> Steven