[R] using regular expressions to retrieve a digit-digit-dot structure from a string

Tue Jun 9 17:58:13 CEST 2009

Another way to handle is to match the contents rather than the
delimiters using strapply in gsubfn (http://gsubfn.googlecode.com).

Below a sentence is defined as starting with a non-space followed by
anything followed by an alpha followed by dot, question mark or
exclamation mark.

The (?U) means that we use ungreedy matching so that the first
sentence terminator matches rather than the last.

This is an arguably simpler regexp as it does not involve lookbehind.

> txt <-  "One January 1. I saw Rick. He was born in the 19. century."

> library(gsubfn)
> strapply(txt, "(?U)([^ ].*[[:alpha:]][.?!])", c, perl = TRUE)
[[1]]
[1] "One January 1. I saw Rick."      "He was born in the 19. century."

This won't be as fast as the other solutions but typically speed is
not a real consideration unless you have huge amounts of text.  Also
note that the development version of strapply runs 5x faster than
the production version on certain sample problems.

On Tue, Jun 9, 2009 at 10:40 AM, Mark Heckmann<mark.heckmann at gmx.de> wrote:
>
> Thanks,
>
> Now it works great. I modified it a bit so the sentences will be split by
> questionmarks (.?!:), etc. as well.
>
> strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]
>
> e.g.
>
>> strsplit(gsub("([[:alpha:]][\\.\\?\\!\\:])", "\\1*", txt), "\\* *") [[1]]
> [1] "One January 1. I saw Rick?"      "He was born in the 19. century."
>
> -------------------------------
>
> Mark Heckmann
> + 49 (0) 421 - 1614618
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Marc Schwartz [mailto:marc_schwartz at me.com]
> Gesendet: Dienstag, 9. Juni 2009 14:17
> An: Mark Heckmann
> Cc: r-help at r-project.org; 'Gabor Grothendieck';
> Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
> Betreff: Re: AW: [R] using regular expressions to retrieve a digit-digit-dot
> structure from a string
>
> On Jun 9, 2009, at 6:44 AM, Mark Heckmann wrote:
>
>> Hey all,
>>
>> Thanks for your help. Your answers solved the problem I posted and
>> that is
>> just when I noticed that I misspecified the problem ;)
>> My problem is to separate a German texts by sentences. Unfortunately I
>> haven't found an R package doing this kind of text separation in
>> German, so
>> I try it "manually".
>>
>> Just using the dot as separator fails in occasions like:
>> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>>
>> Here I want the algorithm to separate the string only at the
>> positions where
>> the dot is not preceded by a digit. The R-snippets posted pick out
>> "1." and
>> "19."
>>
>> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>>> gregexpr('(?<=[0-9])[.]',txt, perl=T)
>> [[1]]
>> [1] 14 49
>> attr(,"match.length")
>> [1] 1 1
>>
>> But I just need it the other way round. So I tried:
>>
>>> strsplit(txt, "[[:alpha:]]\\." , perl=T)
>> [[1]]
>> [1] "One January 1. I saw Ric"       " He was born in the 19. centur"
>>
>> But this erases the last letter from each sentence. Does someone
>> know a
>> solution?
>>
>> TIA
>> Mark
>
> <snip>
>
> This is one of those rare? times where it might be nice for strsplit()
> to have an option to retain the split regex at the end of each parsed
> segment, rather than removing it.
>
> There may be a better way, but trying to both avoid a loop over vector
> indices and trying to stay with R functions that use .Internal() for
> speed, you may be able to use something like this:
>
>  > strsplit(gsub("([[:alpha:]]\\.)", "\\1*", txt), "\\* *")
> [[1]]
> [1] "One January 1. I saw Rick."      "He was born in the 19. century."
>
> What I am essentially doing is to add an "*" to the ending of each
> sentence (you can use other characters) such that strsplit() can split
> on that character without affecting the rest of the sentence.  So as
> an intermediate result, you get:
>
>  > gsub("([[:alpha:]]\\.)", "\\1*", txt)
> [1] "One January 1. I saw Rick.* He was born in the 19. century.*"
>
> which then makes the strsplit() parsing a bit easier. Since both
> strsplit() and grep() use .Internal()s, hopefully this would still be
> reasonably fast. Note that I have strsplit() split on the "*" possibly
> followed by one or more " ", which is required for mid-line splits.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>