[Rd] writeLines argument useBytes = TRUE still making conversions
Ista Zahn
istazahn at gmail.com
Thu Feb 15 18:16:59 CET 2018
On Thu, Feb 15, 2018 at 11:19 AM, Kevin Ushey <kevinushey at gmail.com> wrote:
> I suspect your UTF-8 string is being stripped of its encoding before
> write, and so assumed to be in the system native encoding, and then
> re-encoded as UTF-8 when written to the file. You can see something
> similar with:
>
> > tmp <- 'é'
> > tmp <- iconv(tmp, to = 'UTF-8')
> > Encoding(tmp) <- "unknown"
> > charToRaw(iconv(tmp, to = "UTF-8"))
> [1] c3 83 c2 a9
>
> It's worth saying that:
>
> file(..., encoding = "UTF-8")
>
> means "attempt to re-encode strings as UTF-8 when writing to this
> file". However, if you already know your text is UTF-8, then you
> likely want to avoid opening a connection that might attempt to
> re-encode the input. Conversely (assuming I'm understanding the
> documentation correctly)
>
> file(..., encoding = "native.enc")
>
> means "assume that strings are in the native encoding, and hence
> translation is unnecessary". Note that it does not mean "attempt to
> translate strings to the native encoding".
If all that is true I think ?file needs some attention. I've read it
several times now and I just don't see how it can be interpreted as
you've described it.
Best,
Ista
>
> Also note that writeLines(..., useBytes = FALSE) will explicitly
> translate to the current encoding before sending bytes to the
> requested connection. In other words, there are two locations where
> translation might occur in your example:
>
> 1) In the call to writeLines(),
> 2) When characters are passed to the connection.
>
> In your case, it sounds like translation should be suppressed at both steps.
>
> I think this is documented correctly in ?writeLines (and also the
> Encoding section of ?file), but the behavior may feel unfamiliar at
> first glance.
>
> Kevin
>
> On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:
>>
>> I think this behavior is inconsistent with the documentation:
>>
>> tmp <- 'é'
>> tmp <- iconv(tmp, to = 'UTF-8')
>> print(Encoding(tmp))
>> print(charToRaw(tmp))
>> tmpfilepath <- tempfile()
>> writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)
>>
>> [1] "UTF-8"
>> [1] c3 a9
>>
>> Raw text as hex: c3 83 c2 a9
>>
>> If I switch to useBytes = FALSE, then the variable is written correctly as c3 a9.
>>
>> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list