[R] How to extract a specific substring from a string (regular expressions) ? See details inside

Wed Sep 16 16:14:27 CEST 2009

Try this:

library(gsubfn)
strapply(x, "[A-Z]{3}[0-9]+")

On Wed, Sep 16, 2009 at 10:53 AM, Giulio Di Giovanni
<perimessaggini at hotmail.com> wrote:
>
>
>
> Hi all,
>
> I have thousands of strings like these ones:
>
>
>
> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>
> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>
> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE"
>
>
>
> and various others..
>
>
>
> I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111).
>
> I found only one common criterion to identify the protein codes in ALL my strings:
>
> I need a sequence of characters selected in this way:
>
>
>
> start:
>
> the first alphabetic capital letter followed after three characters by a digit
>
>
>
> end:
>
> the last following digit before a non-digit character, or nothing.
>
>
>
> Tricky, isn't it?
>
> Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions).
>
> Maybe there are other more useful functions  or maybe is just a matter to use regular expression in a better way...
>
>
>
> Can anybody help me?
>
>
>
> Thanks a lot in advance...
>
>
> _________________________________________________________________
> Racconta la tua estate, crea il tuo blog.
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O