[R] Removing & generating data by category

David Winsemius dwinsemius at comcast.net
Fri Oct 30 02:26:35 CET 2009


Color me puzzled. Can you express the run more clearly in Boolean logic?

If someone has five policies: 3 Life and 2 General ...  is he in or out?

Applying the alternate strategy to that data set I get:
out <- tapply( dat$clm, dat$uid, paste ,collapse=",")
 >
 > out
                          A1.B1                           
A2.B2                          A3.B1
                      "General"                  
"General,Life"                      "General"
                          A3.B3                           
A4.B4                          A5.B5
"General,Life,General,General"          
"General,Life,General"                 "General,Life"

Please explain why you want A3.B3.

-- 
David.

On Oct 29, 2009, at 8:56 PM, Steven Kang wrote:

> Highly appreciate for all the help.
>
> I have one more thing to resolve..
>
> Suppose 3 additional records are binded to the previous arbitrary  
> data set.
> i.e
> > a <-  
> data 
> .frame 
> (id 
> = 
> c 
> (c 
> ("A1 
> ","A2 
> ","A3 
> ","A4 
> ","A5 
> "),c 
> ("A3 
> ","A2 
> ","A3 
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"), 
> 6),rep("Life",4)))
> > b <-  
> data 
> .frame(id=c("A3","A3","A4"),loc=c("B3","B3","B4"),clm=rep("General", 
> 3))
> > dat <- rbind(a,b)
> > dat
>    id loc     clm
> 1  A1  B1 General
> 2  A2  B2 General
> 3  A3  B3 General
> 4  A4  B4 General
> 5  A5  B5 General
> 6  A3  B1 General
> 7  A2  B2    Life
> 8  A3  B3    Life
> 9  A4  B4    Life
> 10 A5  B5    Life
> 11 A3  B3 General
> 12 A3  B3 General
> 13 A4  B4 General
>
> The records with row number 3, 11 & 12 and records with row number 4  
> & 13 are identical.
>
>      id  loc    clm                id  loc   clm
> 3  A3  B3 General          4 A4 B4 General
> 11 A3  B3 General        13 A4 B4 General
> 12 A3  B3 General
>
> The provided solutions does not perform 1 to 1 matching. (i.e all  
> the matching duplicated records are removed..)
>
> The desired output is:
>
>      id   loc  clm
> 1   A1  B1 General
> 6   A3  B1 General
> 11 A3  B3 General
> 12 A3  B3 General
> 13 A4  B4 General
>
> Are there solution to this problem with 'merging' function or other  
> alternative method?
>
> Thanks
>
>
>
> Steven
>
>
>
>
>
>
>
> On Thu, Oct 29, 2009 at 10:30 PM, Adaikalavan Ramasamy <a.ramasamy at imperial.ac.uk 
> > wrote:
> Here is another way based on pasting ids as hinted below:
>
>
> a <- data.frame(id=c(c("A1","A2","A3","A4","A5"),
>                   c("A3","A2","A3","A4","A5")),
>                   loc=c("B1","B2","B3","B4","B5"),
>                   clm=c(rep(("General"),6),rep("Life",4)))
>
> a$uid <- paste(a$id, ".", a$loc, sep="")
>
> out <- tapply( a$clm, a$uid, paste ) # can also add collapse=","
> $A1.B1
> [1] "General"
>
> $A2.B2
> [1] "General" "Life"
>
> $A3.B1
> [1] "General"
>
> $A3.B3
> [1] "General" "Life"
>
> $A4.B4
> [1] "General" "Life"
>
> $A5.B5
> [1] "General" "Life"
>
>
> Then here are those with single policies.
>
> > out[ which( sapply(out, length) == 1 ) ]
> $A1.B1
> [1] "General"
>
> $A3.B1
> [1] "General"
>
>
>
>
> David Winsemius wrote:
> On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:
>
> Dear R users,
>
>
> Basically, from the following arbitrary data set:
>
> a <-
> data
> .frame
> (id
> =
> c
> (c
> ("A1
> ","A2
> ","A3
> ","A4
> ","A5
> "),c
> ("A3
> ","A2
> ","A3
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),  
> 6),rep("Life",4)))
>
> a
>   id   loc  clm
> 1  A1  B1 General
> 2  A2  B2 General
> 3  A3  B3 General
> 4  A4  B4 General
> 5  A5  B5 General
> 6  A3  B1 General
> 7  A2  B2    Life
> 8  A3  B3    Life
> 9  A4  B4    Life
> 10 A5  B5    Life
>
> I desire removing records (highlighted records above) with  
> identical  values
> in each fields ("id" & "loc") but with different value of "clm" (i.e
> according to category)
>
> Take a look at this merge operation on separate rows of "a".
>
>  > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id",   
> "loc"), all=T)
>   id loc clm.x   clm.y
> 1 A1  B1  <NA> General
> 2 A2  B2  Life General
> 3 A3  B1  <NA> General
> 4 A3  B3  Life General
> 5 A4  B4  Life General
> 6 A5  B5  Life General
>
> Assignment of that object and selection with is.na should complete  
> the  process.
>
>  > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] ,   
> by=c("id", "loc"), all=T)
>
>  > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ]
>   id loc clm.x   clm.y
> 1 A1  B1  <NA> General
> 3 A3  B1  <NA> General
>
> Alternate methods might include paste-ing id to loc and removing   
> duplicates.
>
>
> i.e
> categ <- table(a$id,a$clm)
> categ
>    General Life
>  A1       1    0
>  A2       1    1
>  A3       2    1
>  A4       1    1
>  A5       1    1
>
> The desired output is
>
>   id   loc  clm
> 1  A1  B1 General
> 6  A3  B1 General
>
> Because the data set I am working on is quite big (~ 800,000 x 20)
> with majority of the fields values being long strings, looping   
> turned out to
> be very inefficient in comapring individual rows..
>
> Are there any alternative efficient methods in implementing this   
> problem?
> Steven
>
>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list