[Rd] writeLines argument useBytes = TRUE still making conversions

Thu Feb 15 18:16:59 CET 2018

On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com> wrote:
> I suspect your UTF-8 string is being stripped of its encoding before
> write, and so assumed to be in the system native encoding, and then
> re-encoded as UTF-8 when written to the file. You can see something
> similar with:
>
>     > tmp <- 'é'
>     > tmp <- iconv(tmp, to = 'UTF-8')
>     > Encoding(tmp) <- "unknown"
>     > charToRaw(iconv(tmp, to = "UTF-8"))
>     [1] c3 83 c2 a9
>
> It's worth saying that:
>
>     file(..., encoding = "UTF-8")
>
> means "attempt to re-encode strings as UTF-8 when writing to this
> file". However, if you already know your text is UTF-8, then you
> likely want to avoid opening a connection that might attempt to
> re-encode the input. Conversely (assuming I'm understanding the
> documentation correctly)
>
>     file(..., encoding = "native.enc")
>
> means "assume that strings are in the native encoding, and hence
> translation is unnecessary". Note that it does not mean "attempt to
> translate strings to the native encoding".

If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.

Best,
Ista

>
> Also note that writeLines(..., useBytes = FALSE) will explicitly
> translate to the current encoding before sending bytes to the
> requested connection. In other words, there are two locations where
> translation might occur in your example:
>
>    1) In the call to writeLines(),
>    2) When characters are passed to the connection.
>
> In your case, it sounds like translation should be suppressed at both steps.
>
> I think this is documented correctly in ?writeLines (and also the
> Encoding section of ?file), but the behavior may feel unfamiliar at
> first glance.
>
> Kevin
>
> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:
>>
>> I think this behavior is inconsistent with the documentation:
>>
>>   tmp <- 'é'
>>   tmp <- iconv(tmp, to = 'UTF-8')
>>   print(Encoding(tmp))
>>   print(charToRaw(tmp))
>>   tmpfilepath <- tempfile()
>>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)
>>
>> [1] "UTF-8"
>> [1] c3 a9
>>
>> Raw text as hex: c3 83 c2 a9
>>
>> If I switch to useBytes = FALSE, then the variable is written correctly as  c3 a9.
>>
>> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel