[Rd] readLines interaction with gsub different in R-dev

Tomas Kalibera tomas.kalibera at gmail.com
Mon Feb 19 15:58:30 CET 2018


Thank you for the report and analysis. Now fixed in R-devel.
Tomas

On 02/17/2018 08:24 PM, William Dunlap via R-devel wrote:
> I think the problem in R-devel happens when there are non-ASCII characters
> in any
> of the strings passed to gsub.
>
> txt <- vapply(list(as.raw(c(0x41, 0x6d, 0xc3, 0xa9, 0x6c, 0x69, 0x65)),
> as.raw(c(0x41, 0x6d, 0x65, 0x6c, 0x69, 0x61))), rawToChar, "")
> txt
> #[1] "Amélie" "Amelia"
> Encoding(txt)
> #[1] "unknown" "unknown"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt)
> #[1] "<a" "<a"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[1])
> #[1] "<a"
> gsub(perl=TRUE, "(\\w)(\\w)", "<\\L\\1\\U\\2>", txt[2])
> #[1] "<aM><eL><iA>"
>
> I can change the Encoding to "latin1" or "UTF-8" and get similar results
> from gsub.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Feb 17, 2018 at 7:35 AM, Hugh Parsonage <hugh.parsonage at gmail.com>
> wrote:
>
>> | Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
>> regexp
>> | you use wrong, ie isn't R-devel giving the correct answer?
>>
>> No, I don't think R-devel is correct (or at least consistent with the
>> documentation). My interpretation of gsub("(\\w)", "\\U\\1", entry,
>> perl = TRUE) is "Take every word character and replace it with itself,
>> converted to uppercase."
>>
>> Perhaps my example was too minimal. Consider the following:
>>
>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>> [1] "A"
>>
>> R> gsub("(\\w)", "\\1", entry, perl = TRUE)
>> [1] "author: Amélie"   # OK, but very different to 'A', despite only
>> not specifying uppercase
>>
>> R> gsub("(\\w)", "\\U\\1", "author: Amelie", perl = TRUE)
>> [1] "AUTHOR: AMELIE"  # OK, but very different to 'A',
>>
>> R> gsub("^(\\w+?): (\\w)", "\\U\\1\\E: \\2", entry, perl = TRUE)
>>   "AUTHOR"  # Where did everything after the first group go?
>>
>> I should note the following example too:
>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE, useBytes = TRUE)
>> [1] "AUTHOR: AMéLIE"  # latin1 encoding
>>
>>
>> A call to `readLines` (possibly `scan()` and `read.table` and friends)
>> is essential.
>>
>>
>>
>>
>> On 18 February 2018 at 02:15, Dirk Eddelbuettel <edd at debian.org> wrote:
>>> On 17 February 2018 at 21:10, Hugh Parsonage wrote:
>>> | I was told to re-raise this issue with R-dev:
>>> |
>>> | In the documentation of R-dev and R-3.4.3, under ?gsub
>>> |
>>> | > replacement
>>> | >    ... For perl = TRUE only, it can also contain "\U" or "\L" to
>> convert the rest of the replacement to upper or lower case and "\E" to end
>> case conversion.
>>> |
>>> | However, the following code runs differently:
>>> |
>>> | tempf <- tempfile()
>>> | writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
>>> | entry <- readLines(tempf, encoding = "UTF-8")
>>> | gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>>> |
>>> |
>>> | "AUTHOR: AMÉLIE"  # R-3.4.3
>>> |
>>> | "A"                              # R-dev
>>>
>>> Confirmed for R-devel (current) on Ubuntu 17.10.  But ... isn't the
>> regexp
>>> you use wrong, ie isn't R-devel giving the correct answer?
>>>
>>> R> tempf <- tempfile()
>>> R> writeLines(enc2utf8("author: Amélie"), con = tempf, useBytes = TRUE)
>>> R> entry <- readLines(tempf, encoding = "UTF-8")
>>> R> gsub("(\\w)", "\\U\\1", entry, perl = TRUE)
>>> [1] "A"
>>> R> gsub("(\\w+)", "\\U\\1", entry, perl = TRUE)
>>> [1] "AUTHOR"
>>> R> gsub("(.*)", "\\U\\1", entry, perl = TRUE)
>>> [1] "AUTHOR: AMÉLIE"
>>> R>
>>>
>>> Dirk
>>>
>>> --
>>> http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list