[BioC] extracting character string

Wed Jun 17 01:57:40 CEST 2009

Hi Hari, Mark,

Mark Robinson wrote:
> Hi Hari.
> 
> strsplit() will work, its just sensitive.  For starters, you might try:
> 
>  > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158",
> + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79")
>  >
>  > strsplit(x,"\\|")
> [[1]]
> [1] "ref"           "NM_004564"     "ref"           "PET112L:2131"
> [5] "mgc"           "BC130348:2158"
> 
> [[2]]
> [1] "ref"           "NM_007266"     "ref"           "XAB1:2255"
> [5] "mgc"           "BC007451:2239"
> 
> [[3]]
> [1] "mgc"         "BC034752:79"

Note that it's better here to use strsplit() with fixed=TRUE. Then no
need to escape the | and in addition strsplit() will be much faster...

Cheers,
H.

> 
> 
> And, for extracting the first 2 columns, maybe you'll want to migrate 
> towards something like:
> 
>  > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], 
> USE.NAMES=FALSE))
>      [,1]  [,2]
> [1,] "ref" "NM_004564"
> [2,] "ref" "NM_007266"
> [3,] "mgc" "BC034752:79"
> 
> Hope that gets you started.
> 
> Cheers,
> Mark
> 
> 
> On 17/06/2009, at 7:54 AM, Hari Easwaran wrote:
> 
>> Hi all,
>> I am working with Agilent microarray data and trying to extract only the
>> accession numbers from the output probe annotation. Basically I have a
>> column detailing the probe as follows:
>>
>> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158
>> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239
>> mgc|BC034752:79
>> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45605|mirna|hsa-mir-375:5790 
>>
>> ...
>>
>> I am trying to extract only the Refseq IDs (in this case NM_004564,
>> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column
>> with the IDs. I am not able to figure out how to do this. I tried 
>> using  the
>> function 'strsplit',  but it doesn't work.
>> I am a newbie to R/Bioconductor and would appreciate if someone can help.
>>
>> Thanks.
>> Hari
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: m.robinson at garvan.org.au
> e: mrobinson at wehi.edu.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319