[R] How to extract a specific substring from a string (regular expressions) ? See details inside
David Winsemius
dwinsemius at comcast.net
Wed Sep 16 16:52:45 CEST 2009
That did not work on all three:
> strapply(x, "[A-Z]{3}[0-9]+")
[[1]]
NULL
[[2]]
[1] "CAA15575"
[[3]]
[1] "CAA17111"
But adding a "_" to the initiation pattern and a period to the
termination pattern makes it complete:
> library(gsubfn)
> strapply(x, "[A-Z_]{3}[0-9.]+")
[[1]]
[1] "YP_177963"
[[2]]
[1] "CAA15575"
[[3]]
[1] "CAA17111.1"
Maybe between the two of you and Jim Holtman, I can eventually learn
how to use regular expressions.
--
David.
On Sep 16, 2009, at 10:14 AM, Henrique Dallazuanna wrote:
> Try this:
>
> library(gsubfn)
> strapply(x, "[A-Z]{3}[0-9]+")
>
> On Wed, Sep 16, 2009 at 10:53 AM, Giulio Di Giovanni
> <perimessaggini at hotmail.com> wrote:
>>
>>
>>
>> Hi all,
>>
>> I have thousands of strings like these ones:
>>
>>
>>
>> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>>
>> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>>
>> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
>> DEHYDROGENASE"
>>
>>
>>
>> and various others..
>>
>>
>>
>> I'm interested to extract the code for the protein (in this
>> example: YP_177963, CAA15575, CAA17111).
>>
>> I found only one common criterion to identify the protein codes in
>> ALL my strings:
>>
>> I need a sequence of characters selected in this way:
>>
>>
>>
>> start:
>>
>> the first alphabetic capital letter followed after three characters
>> by a digit
>>
>>
>>
>> end:
>>
>> the last following digit before a non-digit character, or nothing.
>>
>>
>>
>> Tricky, isn't it?
>>
>> Well, I'm not an expert, and I played a lot with regular
>> expressions and sub() command with no big results. Also with
>> substring.location in Hmisc package (but here I don't know how to
>> use regular expressions).
>>
>> Maybe there are other more useful functions or maybe is just a
>> matter to use regular expression in a better way...
>>
>>
>>
>> Can anybody help me?
>>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list