[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 09:04:47 CEST 2009

Gabor Grothendieck wrote:
> On Mon, Jun 8, 2009 at 7:18 PM, Wacek
> Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>   
>> Gabor Grothendieck wrote:
>>     
>>> Try this.  See ?regex for more.
>>>
>>>
>>>       
>>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
>>>>
>>>>         
>>> [1] 24
>>> attr(,"match.length")
>>> [1] 1
>>>
>>>       
>> yes, but
>>
>>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>>    # 2 5 9
>>     
>
> Yes, it should be:
>
>   
>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRUE)
>>     
> [[1]]
> [1] 5 9
> attr(,"match.length")
> [1] 1 1
>
> which displays the position of every dot that is preceded
> immediately by a digit.  Or just replace gregexpr with regexpr
> if its intended that it match only one.
>   

i guess what was needed was something like

    gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
    # 5

which won't work, however, because pcre does not support variable-width
lookbehinds.

>   
>> which, i guess, is not what you want.  if what you want is to match all
>> and only dots that follow at least one digit preceded by a word
>> boundary, then the following should do, as far as i can see:
>>
>>    gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1.', perl=TRUE)
>>    # 5
>>
>> vQ
>>