[Rd] readLines interaction with gsub different in R-dev
Hugh Parsonage
hugh.parsonage at gmail.com
Sat Feb 17 16:35:59 CET 2018
| Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp
| you use wrong, ie isn't R-devel giving the correct answer?
No, I don't think R-devel is correct (or at least consistent with the
documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
perl = TRUE) is "Take every word character and replace it with itself,
converted to uppercase."
Perhaps my example was too minimal. Consider the following:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
[1] "A"
R> gsub("(\\w)", "\\1", entry, perl = TRUE)
[1] "author: Amélie" # OK, but very different to 'A', despite only
not specifying uppercase
R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
[1] "AUTHOR: AMELIE" # OK, but very different to 'A',
R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
"AUTHOR" # Where did everything after the first group go?
I should note the following example too:
R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
[1] "AUTHOR: AMéLIE" # latin1 encoding
A call to `readLines` (possibly `scan()` and `read.table` and friends)
is essential.
On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
> | I was told to re-raise this issue with R-dev:
> |
> | In the documentation of R-dev and R-3.4.3, under ?gsub
> |
> | > replacement
> | > ... For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.
> |
> | However, the following code runs differently:
> |
> | tempf <- tempfile()
> | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> | entry <- readLines(tempf, encoding = "UTF-8")
> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> |
> |
> | "AUTHOR: AMÉLIE" # R-3.4.3
> |
> | "A" # R-dev
>
> Confirmed for R-devel (current) on Ubuntu 17.10. But ... isn't the regexp
> you use wrong, ie isn't R-devel giving the correct answer?
>
> R> tempf <- tempfile()
> R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
> R> entry <- readLines(tempf, encoding = "UTF-8")
> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
> [1] "A"
> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR"
> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
> [1] "AUTHOR: AMÉLIE"
> R>
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
More information about the R-devel
mailing list