[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 20:21:45 CEST 2009

If there were significant advantage to that perl module
I would recommend interfacing R to it rather than
suffer with perl.

For example, see xls2csv (and read.xls) in the gdata package
for an example of interfacing to a perl program.

I don't want to turn this into an R vs. perl thread but there are
certainly many people using R for linguistics and, like perl
which has books on perl and linguistics, there are books
specifically on R and linguistics.  One is mentioned on the
gsubfn site.  Also there are many linguistics packages in
R that could be explored:

   http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

and there would be a big advantage of being able to leverage all
of R's other capabilities.

On Tue, Jun 9, 2009 at 1:42 PM, Greg Snow<Greg.Snow at imail.org> wrote:
> Yes, I already apologized to Wacek for missing that and pointing out what he had already said.
>
> Given everything in this thread (though it is hard to keep track of all of it, my e-mail client does not keep all the parts of the thread together), this is probably one of those few tasks that R is not the best tool for.  There is a Perl module called Lingua::DE::Sentence with the description: "Perl extension for tokenizing german texts into their sentences" which seems to be exactly what the original poster was looking for.  So the best option may be to use Perl and the above module to preprocess his texts, then use R for later steps.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Gabor Grothendieck
>> Sent: Tuesday, June 09, 2009 11:27 AM
>> To: Greg Snow
>> Cc: Wacek Kusnierczyk; r-help at r-project.org; Mark Heckmann
>> Subject: Re: [R] using regular expressions to retrieve a digit-digit-
>> dot structure from a string
>>
>> Wacek already mentioned that; however, its still
>> arguably more complex to specify delimiters
>> than to specify content.  Aside from having
>> to specify perl = TRUE and ungreedy matching
>> the content-based regexp is entirely straight forward
>> but for lookbehind (including \K) one has the added
>> complexity of distinguishing between matching and returned
>> values.
>>
>> On Tue, Jun 9, 2009 at 12:36 PM, Greg Snow<Greg.Snow at imail.org> wrote:
>> > You can sometimes fake variable width look behinds with Perl regexs
>> using '\K':
>> >
>> >> gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1. 11.', perl=TRUE)
>> > [[1]]
>> > [1]  5 13
>> > attr(,"match.length")
>> > [1] 1 1
>> >
>> >
>> > --
>> > Gregory (Greg) L. Snow Ph.D.
>> > Statistical Data Center
>> > Intermountain Healthcare
>> > greg.snow at imail.org
>> > 801.408.8111
>> >
>> >
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Wacek Kusnierczyk
>> >> Sent: Tuesday, June 09, 2009 1:05 AM
>> >> To: Gabor Grothendieck
>> >> Cc: r-help at r-project.org; Mark Heckmann
>> >> Subject: Re: [R] using regular expressions to retrieve a digit-
>> digit-
>> >> dot structure from a string
>> >>
>> >> Gabor Grothendieck wrote:
>> >> > On Mon, Jun 8, 2009 at 7:18 PM, Wacek
>> >> > Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
>> >> >
>> >> >> Gabor Grothendieck wrote:
>> >> >>
>> >> >>> Try this.  See ?regex for more.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
>> >> >>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
>> >> >>>>
>> >> >>>>
>> >> >>> [1] 24
>> >> >>> attr(,"match.length")
>> >> >>> [1] 1
>> >> >>>
>> >> >>>
>> >> >> yes, but
>> >> >>
>> >> >>    gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>    # 2 5 9
>> >> >>
>> >> >
>> >> > Yes, it should be:
>> >> >
>> >> >
>> >> >> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>
>> >> > [[1]]
>> >> > [1] 5 9
>> >> > attr(,"match.length")
>> >> > [1] 1 1
>> >> >
>> >> > which displays the position of every dot that is preceded
>> >> > immediately by a digit.  Or just replace gregexpr with regexpr
>> >> > if its intended that it match only one.
>> >> >
>> >>
>> >> i guess what was needed was something like
>> >>
>> >>     gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
>> >>     # 5
>> >>
>> >> which won't work, however, because pcre does not support variable-
>> width
>> >> lookbehinds.
>> >>
>> >> >
>> >> >> which, i guess, is not what you want.  if what you want is to
>> match
>> >> all
>> >> >> and only dots that follow at least one digit preceded by a word
>> >> >> boundary, then the following should do, as far as i can see:
>> >> >>
>> >> >>    gregexpr('\\b[0-9]+\\K[.]', 'a. 1. a1.', perl=TRUE)
>> >> >>    # 5
>> >> >>
>> >> >> vQ
>> >> >>
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>