[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 19:27:29 CEST 2009

Wacek already mentioned that; however, its still
arguably more complex to specify delimiters
than to specify content.  Aside from having
to specify perl = TRUE and ungreedy matching
the content-based regexp is entirely straight forward
but for lookbehind (including \K) one has the added
complexity of distinguishing between matching and returned
values.

On Tue, Jun 9, 2009 at 12:36 PM, Greg Snow<Greg.Snow at imail.org> wrote:
> You can sometimes fake variable width look behinds with Perl regexs using '\K':
>
>> gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1. 11.', perl=TRUE)
> [[1]]
> [1]  5 13
> attr(,"match.length")
> [1] 1 1
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Wacek Kusnierczyk
>> Sent: Tuesday, June 09, 2009 1:05 AM
>> To: Gabor Grothendieck
>> Cc: r-help at r-project.org; Mark Heckmann
>> Subject: Re: [R] using regular expressions to retrieve a digit-digit-
>> dot structure from a string
>>
>> Gabor Grothendieck wrote:
>> > On Mon, Jun 8, 2009 at 7:18 PM, Wacek
>> > Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>> >
>> >> Gabor Grothendieck wrote:
>> >>
>> >>> Try this.  See ?regex for more.
>> >>>
>> >>>
>> >>>
>> >>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
>> >>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
>> >>>>
>> >>>>
>> >>> [1] 24
>> >>> attr(,"match.length")
>> >>> [1] 1
>> >>>
>> >>>
>> >> yes, but
>> >>
>> >>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>> >>    # 2 5 9
>> >>
>> >
>> > Yes, it should be:
>> >
>> >
>> >> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRUE)
>> >>
>> > [[1]]
>> > [1] 5 9
>> > attr(,"match.length")
>> > [1] 1 1
>> >
>> > which displays the position of every dot that is preceded
>> > immediately by a digit.  Or just replace gregexpr with regexpr
>> > if its intended that it match only one.
>> >
>>
>> i guess what was needed was something like
>>
>>     gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>>     # 5
>>
>> which won't work, however, because pcre does not support variable-width
>> lookbehinds.
>>
>> >
>> >> which, i guess, is not what you want.  if what you want is to match
>> all
>> >> and only dots that follow at least one digit preceded by a word
>> >> boundary, then the following should do, as far as i can see:
>> >>
>> >>    gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1.', perl=TRUE)
>> >>    # 5
>> >>
>> >> vQ
>> >>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>