[R] using regular expressions to retrieve a digit-digit-dotstructure from a string
William Dunlap
wdunlap at tibco.com
Tue Jun 9 17:29:55 CEST 2009
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Mark Heckmann
> Sent: Tuesday, June 09, 2009 4:45 AM
> To: r-help at r-project.org
> Cc: Waclaw.Marcin.Kusnierczyk at idi.ntnu.no; marc_schwartz at me.com
> Subject: Re: [R] using regular expressions to retrieve a
> digit-digit-dotstructure from a string
>
> Hey all,
>
> Thanks for your help. Your answers solved the problem I
> posted and that is
> just when I noticed that I misspecified the problem ;)
> My problem is to separate a German texts by sentences. Unfortunately I
> haven't found an R package doing this kind of text separation
> in German, so
> I try it "manually".
>
> Just using the dot as separator fails in occasions like:
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
>
> Here I want the algorithm to separate the string only at the
> positions where
> the dot is not preceded by a digit. The R-snippets posted
> pick out "1." and
> "19."
>
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
> > gregexpr('(?<=[0-9])[.]',txt, perl=T)
> [[1]]
> [1] 14 49
> attr(,"match.length")
> [1] 1 1
>
> But I just need it the other way round. So I tried:
>
> > strsplit(txt, "[[:alpha:]]\\." , perl=T)
> [[1]]
> [1] "One January 1. I saw Ric" " He was born in the 19. centur"
>
> But this erases the last letter from each sentence. Does
> someone know a
> solution?
In S+ strsplit() has an argument called subpattern that lets you
specify which parenthesized part of the regular expression
to use as the split point. It is the akin to the \\<digit> used in the
replacement argument of sub and gsub. E.g., to split the string
at the sequence of spaces after a period, but not after period preceded
by a digit do:
> txt <- "One January 1. I saw Rick. He was born in the 19. century."
> strsplit(txt, "[^[:digit:]]\\.([[:space:]]+)", subpattern=1)
[[1]]:
[1] "One January 1. I saw Rick." "He was born in the 19. century."
subpattern=0, the default, means text matched by the entire regular
expression. regexpr has the same argument. Would such an argument
solve your problem?
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
> TIA
> Mark
>
> -------------------------------
>
> Mark Heckmann
> + 49 (0) 421 - 1614618
> www.markheckmann.de
> R-Blog: http://ryouready.wordpress.com
>
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Gesendet: Dienstag, 9. Juni 2009 12:48
> An: Wacek Kusnierczyk
> Cc: Mark Heckmann; r-help at r-project.org
> Betreff: Re: [R] using regular expressions to retrieve a
> digit-digit-dot
> structure from a string
>
> On Tue, Jun 9, 2009 at 3:04 AM, Wacek
> Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> > Gabor Grothendieck wrote:
> >> On Mon, Jun 8, 2009 at 7:18 PM, Wacek
> >> Kusnierczyk<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> >>
> >>> Gabor Grothendieck wrote:
> >>>
> >>>> Try this. See ?regex for more.
> >>>>
> >>>>
> >>>>
> >>>>> x <- 'This happened in the 21. century." (the dot behind 21 is'
> >>>>> regexpr("(?![0-9]+)[.]", x, perl = TRUE)
> >>>>>
> >>>>>
> >>>> [1] 24
> >>>> attr(,"match.length")
> >>>> [1] 1
> >>>>
> >>>>
> >>> yes, but
> >>>
> >>> gregexpr('(?![0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
> >>> # 2 5 9
> >>>
> >>
> >> Yes, it should be:
> >>
> >>
> >>> gregexpr('(?<=[0-9])[.]', 'a. 1. a1.', perl=TRU
> E)
> >>>
> >> [[1]]
> >> [1] 5 9
> >> attr(,"match.length")
> >> [1] 1 1
> >>
> >> which displays the position of every dot that is preceded
> >> immediately by a digit. Or just replace gregexpr with regexpr
> >> if its intended that it match only one.
> >>
> >
> > i guess what was needed was something like
> >
> > gregexpr('(?<=\\b[0-9]+)[.]', 'a. 1. a1.', perl=TRUE)
> > # 5
> >
> > which won't work, however, because pcre does not support
> variable-width
> > lookbehinds.
>
> No, what I wrote was what I intended. I don't think we are
> discussing the answer
> at this point but just the interpretation of what was intended. You
> are including
> the word boundary in the question and I am not. I think its
> also possible
> that
> regexpr is what is wanted, not gregexpr, but at this point I think the
> poster has
> enough answers that he can complete it himself by considering
> what he wants
> and using one of ours or a suitable modification.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list