[R] Regex to stop at first capital letter after sequence

David Winsemius dwinsemius at comcast.net
Mon Dec 19 23:01:15 CET 2016


> On Dec 19, 2016, at 1:25 PM, Omar André Gonzáles Díaz <oma.gonzales at gmail.com> wrote:
> 
> I have the following strings:
> 
> [1] "PPA 06 - Promo Vasito"      [2] "PPA 05 - Cuentos"
> [3] "PPA 04 - Promo vasito"      [4] "PPA 03 - Promoción escolar"
> [5] "PPA - Saluda a tu pediatra" [6] "PPL - Dia del Pediatra"
> 
> *Desired result*:
> 
> [1] "Promo Vasito"                 "Cuentos"                "Promo vasito"
> 
> [4] "Promoción escolar"      "Saluda a tu pediatra"   "Dia del Pediatra"

All this assumes you are passing a character vector to sub. The combination of your subject line and the example are a bit underspecified. Here's two solution one of which is delivering everything beginning with the last cap after the (last) dash and the other is delivering everything after but not including the last <dash><spc> sequence:

> sub("^.+[-].+(?=[A-Z])", "" , dat, perl=TRUE)  # need perl=TRUE for PCRE look-ahead
[1] "Vasito"               "Cuentos"             
[3] "Promo vasito"         "Promoción escolar"   
[5] "Saluda a tu pediatra" "Pediatra"       

Greedy matching above, ungreedy; set by '(?U)' below:

> sub("(?U)^.+[-].+(?=[A-Z])", "" , dat, perl=TRUE)
[1] "Promo Vasito"         "Cuentos"             
[3] "Promo vasito"         "Promoción escolar"   
[5] "Saluda a tu pediatra" "Dia del Pediatra"    


> sub("^.+[-][ ]", "" , dat)   # character classes to define sequence.
[1] "Promo Vasito"         "Cuentos"             
[3] "Promo vasito"         "Promoción escolar"   
[5] "Saluda a tu pediatra" "Dia del Pediatra" 
> 
> 
> *First attemp*:
> 
> After this line:
> 
> mead_nov$`Nombre del anuncio` <- gsub("(PPA.*)([A-Z].*)", "\\2",
> mead_nov$`Nombre del anuncio`)
> 
> I get these:
> 
> [1] "Vasito"                 [2] "Cuentos"                [3] "Promo
> vasito"
> [4] "Promoción escolar"      [5] "Saluda a tu pediatra"   [6] "PPL - Dia
> del Pediatra"
> 
> 
> *Second attemp:*
> 
> mead_nov$`Nombre del anuncio` <- gsub("(PPA|PPL.*)([A-Z].*)", "\\2",
> mead_nov$`Nombre del anuncio`)
> 
> I get this:
> 
> [1] "PPA 06 - Promo Vasito"     [2] "PPA 05 - Cuentos"
> [3] "PPA 04 - Promo vasito"      [3] "PPA 03 - Promoción escolar"
> [5] "PPA - Saluda a tu pediatra" [6] "Pediatra"
> 
> 
> Thank you for your help.
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list