[Rd] \U or \L perl regex in gsub removes text outside capturing group in UTF-8 contexts
Hugh Parsonage
hugh.parsonage at gmail.com
Mon Jun 19 13:50:52 CEST 2017
I write to clarify the status of \U and \L when used in the replacement
argument to gsub in R 3.5.0. The behaviour of gsub appears to have changed
from R 3.4.0, but the documentation for the replacement argument has not.
## Reprex (A call to readLines is essential. A url is provided for
convenience but the behaviour should reproduce for local files)
bib <- readLines("
https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib",
encoding = "UTF-8", n = 10)
bib8910 <- bib[8:10]
gsub("(\\w+)", "\\U\\1", bib8910, perl = TRUE)
#> [1] "@TECHREPORT" " AUTHOR" " TITLE"
Expected result (in R 3.4.0):
#> [1] "@TECHREPORT{WOODHUNTEROTOOLEETAL2012,"
#> [2] " AUTHOR = {TONY WOOD AND AMÉLIE HUNTER AND MICHAEL O'TOOLE AND
PRASANA VENKATARAMAN AND LUCY CARTER},"
#> [3] " TITLE = {PUTTING THE CUSTOMER BACK IN FRONT: HOW TO MAKE
ELECTRICITY CHEAPER},"
## Likely point of breaking change
I was alerted on June 13 by Kurt Hornik that my package (TeXCheckR), which
had previously been accepted on CRAN, was ERRORing, as a unit test relies
on \L.
## sessionInfo()
R Under development (unstable) (2017-06-19 r72808)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
Many thanks,
Hugh Parsonage
Associate, Grattan Institute, Melbourne, AU
[[alternative HTML version deleted]]
More information about the R-devel
mailing list