[R] Getting a list of unique gene names from a list with semi-colons
Gabor Grothendieck
ggrothendieck at gmail.com
Sat Jan 7 03:25:24 CET 2012
On Fri, Jan 6, 2012 at 9:05 PM, Kurinji Pandiyan
<kurinji.pandiyan at gmail.com> wrote:
> Hello,
>
> I have one column in my dataframe that has gene names of interest.
> Unfortunately, due to the fact that some probes lie between two genes or
> two transcripts of a gene, it looks something like this -
>
> FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1 MRPL12;MRPL12
> C1orf114 MMS19;UBTD1
> I would like to know how to get a list with all the names with no
> semi-colons and removing the replicates. I would like the end result to
> look like -
>
> FAM81A
> LOC283050
> ZMIZI
> PINK1
> MRPL12
> C1orf114
> MMS19
> UBTD1
>
> Thanks a lot for your help!
> Kurinji
>
This uses strapply in gsubfn:
x <- "FAM81A LOC283050;LOC283050;LOC283050;ZMIZ1 PINK1;PINK1"
library(gsubfn)
unique(strapply(x, "\\w+", c)[[1]])
If x is very long then there is a high speed version of strapply
specialized to using c called strapplyc in the development version of
gsubfn. For example, see this example of extracting 275,000 words from
a novel:
https://groups.google.com/group/corpling-with-r/msg/b85f7ff917cccb5d?dmode=source&output=gplain&noredirect&pli=1
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list