[R] Removing & generating data by category
David Winsemius
dwinsemius at comcast.net
Thu Oct 29 02:54:21 CET 2009
On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:
> Dear R users,
>
>
> Basically, from the following arbitrary data set:
>
> a <-
> data
> .frame
> (id
> =
> c
> (c
> ("A1
> ","A2
> ","A3
> ","A4
> ","A5
> "),c
> ("A3
> ","A2
> ","A3
> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),
> 6),rep("Life",4)))
>
>> a
> id loc clm
> 1 A1 B1 General
> 2 A2 B2 General
> 3 A3 B3 General
> 4 A4 B4 General
> 5 A5 B5 General
> 6 A3 B1 General
> 7 A2 B2 Life
> 8 A3 B3 Life
> 9 A4 B4 Life
> 10 A5 B5 Life
>
> I desire removing records (highlighted records above) with identical
> values
> in each fields ("id" & "loc") but with different value of "clm" (i.e
> according to category)
Take a look at this merge operation on separate rows of "a".
> merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id",
"loc"), all=T)
id loc clm.x clm.y
1 A1 B1 <NA> General
2 A2 B2 Life General
3 A3 B1 <NA> General
4 A3 B3 Life General
5 A4 B4 Life General
6 A5 B5 Life General
Assignment of that object and selection with is.na should complete the
process.
> a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] ,
by=c("id", "loc"), all=T)
> a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ]
id loc clm.x clm.y
1 A1 B1 <NA> General
3 A3 B1 <NA> General
Alternate methods might include paste-ing id to loc and removing
duplicates.
> i.e
>> categ <- table(a$id,a$clm)
>> categ
>
> General Life
> A1 1 0
> A2 1 1
> A3 2 1
> A4 1 1
> A5 1 1
>
> The desired output is
>
> id loc clm
> 1 A1 B1 General
> 6 A3 B1 General
>
> Because the data set I am working on is quite big (~ 800,000 x 20)
> with majority of the fields values being long strings, looping
> turned out to
> be very inefficient in comapring individual rows..
>
> Are there any alternative efficient methods in implementing this
> problem?
> Steven
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list