[R] Removing & generating data by category
David Winsemius
dwinsemius at comcast.net
Fri Oct 30 02:26:35 CET 2009
Color me puzzled. Can you express the run more clearly in Boolean logic?
If someone has five policies: 3 Life and 2 General ... is he in or out?
Applying the alternate strategy to that data set I get:
out <- tapply( dat$clm, dat$uid, paste ,collapse=",")
>
> out
A1.B1
A2.B2 A3.B1
"General"
"General,Life" "General"
A3.B3
A4.B4 A5.B5
"General,Life,General,General"
"General,Life,General" "General,Life"
Please explain why you want A3.B3.
--
David.
On Oct 29, 2009, at 8:56 PM, Steven Kang wrote:
> Highly appreciate for all the help.
>
> I have one more thing to resolve..
>
> Suppose 3 additional records are binded to the previous arbitrary
> data set.
> i.e
> > a <-
> data
> .frame
> (id
> =
> c
> (c
> ("A1
> ","A2
> ","A3
> ","A4
> ","A5
> "),c
> ("A3
> ","A2
> ","A3
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),
> 6),rep("Life",4)))
> > b <-
> data
> .frame(id=c("A3","A3","A4"),loc=c("B3","B3","B4"),clm=rep("General",
> 3))
> > dat <- rbind(a,b)
> > dat
> id loc clm
> 1 A1 B1 General
> 2 A2 B2 General
> 3 A3 B3 General
> 4 A4 B4 General
> 5 A5 B5 General
> 6 A3 B1 General
> 7 A2 B2 Life
> 8 A3 B3 Life
> 9 A4 B4 Life
> 10 A5 B5 Life
> 11 A3 B3 General
> 12 A3 B3 General
> 13 A4 B4 General
>
> The records with row number 3, 11 & 12 and records with row number 4
> & 13 are identical.
>
> id loc clm id loc clm
> 3 A3 B3 General 4 A4 B4 General
> 11 A3 B3 General 13 A4 B4 General
> 12 A3 B3 General
>
> The provided solutions does not perform 1 to 1 matching. (i.e all
> the matching duplicated records are removed..)
>
> The desired output is:
>
> id loc clm
> 1 A1 B1 General
> 6 A3 B1 General
> 11 A3 B3 General
> 12 A3 B3 General
> 13 A4 B4 General
>
> Are there solution to this problem with 'merging' function or other
> alternative method?
>
> Thanks
>
>
>
> Steven
>
>
>
>
>
>
>
> On Thu, Oct 29, 2009 at 10:30 PM, Adaikalavan Ramasamy <a.ramasamy at imperial.ac.uk
> > wrote:
> Here is another way based on pasting ids as hinted below:
>
>
> a <- data.frame(id=c(c("A1","A2","A3","A4","A5"),
> c("A3","A2","A3","A4","A5")),
> loc=c("B1","B2","B3","B4","B5"),
> clm=c(rep(("General"),6),rep("Life",4)))
>
> a$uid <- paste(a$id, ".", a$loc, sep="")
>
> out <- tapply( a$clm, a$uid, paste ) # can also add collapse=","
> $A1.B1
> [1] "General"
>
> $A2.B2
> [1] "General" "Life"
>
> $A3.B1
> [1] "General"
>
> $A3.B3
> [1] "General" "Life"
>
> $A4.B4
> [1] "General" "Life"
>
> $A5.B5
> [1] "General" "Life"
>
>
> Then here are those with single policies.
>
> > out[ which( sapply(out, length) == 1 ) ]
> $A1.B1
> [1] "General"
>
> $A3.B1
> [1] "General"
>
>
>
>
> David Winsemius wrote:
> On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:
>
> Dear R users,
>
>
> Basically, from the following arbitrary data set:
>
> a <-
> data
> .frame
> (id
> =
> c
> (c
> ("A1
> ","A2
> ","A3
> ","A4
> ","A5
> "),c
> ("A3
> ","A2
> ","A3
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),
> 6),rep("Life",4)))
>
> a
> id loc clm
> 1 A1 B1 General
> 2 A2 B2 General
> 3 A3 B3 General
> 4 A4 B4 General
> 5 A5 B5 General
> 6 A3 B1 General
> 7 A2 B2 Life
> 8 A3 B3 Life
> 9 A4 B4 Life
> 10 A5 B5 Life
>
> I desire removing records (highlighted records above) with
> identical values
> in each fields ("id" & "loc") but with different value of "clm" (i.e
> according to category)
>
> Take a look at this merge operation on separate rows of "a".
>
> > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id",
> "loc"), all=T)
> id loc clm.x clm.y
> 1 A1 B1 <NA> General
> 2 A2 B2 Life General
> 3 A3 B1 <NA> General
> 4 A3 B3 Life General
> 5 A4 B4 Life General
> 6 A5 B5 Life General
>
> Assignment of that object and selection with is.na should complete
> the process.
>
> > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] ,
> by=c("id", "loc"), all=T)
>
> > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ]
> id loc clm.x clm.y
> 1 A1 B1 <NA> General
> 3 A3 B1 <NA> General
>
> Alternate methods might include paste-ing id to loc and removing
> duplicates.
>
>
> i.e
> categ <- table(a$id,a$clm)
> categ
> General Life
> A1 1 0
> A2 1 1
> A3 2 1
> A4 1 1
> A5 1 1
>
> The desired output is
>
> id loc clm
> 1 A1 B1 General
> 6 A3 B1 General
>
> Because the data set I am working on is quite big (~ 800,000 x 20)
> with majority of the fields values being long strings, looping
> turned out to
> be very inefficient in comapring individual rows..
>
> Are there any alternative efficient methods in implementing this
> problem?
> Steven
>
>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list