[R] How to extract a specific substring from a string (regular expressions) ? See details inside

David Winsemius dwinsemius at comcast.net
Wed Sep 16 16:40:40 CEST 2009


I'm was guessing that the ".1" was a part of the protein code for  
third example and looking at:
<http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0003840.s002 
 >
I see quite a few protein codes of that form. I am a complete  
ignoramus with regex strings but I am guessing that the OP will need a  
"." added to the termination pattern. Experimentation shows that  
simply adding a period after the "9" works for this example:

pat <- ".*(\\b[A-Z]..[0-9.]+).*"

-- 
David

On Sep 16, 2009, at 10:15 AM, jim holtman wrote:

> This should do it for you:
>
>> pat <- ".*(\\b[A-Z]..[0-9]+).*"
>> grep(pat, x)
> [1] 1 3 5
>> sub(pat, '\\1', x)
> [1] "YP_177963" ""          "CAA15575"  ""          "CAA17111"
>>
>
>
> On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni
> <perimessaggini at hotmail.com> wrote:
>>
>>
>>
>> Hi all,
>>
>> I have thousands of strings like these ones:
>>
>>
>>
>> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>>
>> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>>
>> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE  
>> DEHYDROGENASE"
>>
>>
>>
>> and various others..
>>
>>
>>
>> I'm interested to extract the code for the protein (in this  
>> example: YP_177963, CAA15575, CAA17111).
>>
>> I found only one common criterion to identify the protein codes in  
>> ALL my strings:
>>
>> I need a sequence of characters selected in this way:
>>
>>
>>
>> start:
>>
>> the first alphabetic capital letter followed after three characters  
>> by a digit
>>
>>
>>
>> end:
>>
>> the last following digit before a non-digit character, or nothing.
>>
>>
>>
>> Tricky, isn't it?
>>
>> Well, I'm not an expert, and I played a lot with regular  
>> expressions and sub() command with no big results. Also with  
>> substring.location in Hmisc package (but here I don't know how to  
>> use regular expressions).
>>
>> Maybe there are other more useful functions  or maybe is just a  
>> matter to use regular expression in a better way...
>>
>>
>>
>> Can anybody help me?
>>
>>
>>
>> Thanks a lot in advance...
>>
>>
>> _________________________________________________________________
>> Racconta la tua estate, crea il tuo blog.
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list