[Rd] writeLines argument useBytes = TRUE still making conversions

Kevin Ushey kevinushey at gmail.com
Thu Feb 15 17:19:32 CET 2018


I suspect your UTF-8 string is being stripped of its encoding before
write, and so assumed to be in the system native encoding, and then
re-encoded as UTF-8 when written to the file. You can see something
similar with:

    > tmp <- 'é'
    > tmp <- iconv(tmp, to = 'UTF-8')
    > Encoding(tmp) <- "unknown"
    > charToRaw(iconv(tmp, to = "UTF-8"))
    [1] c3 83 c2 a9

It's worth saying that:

    file(..., encoding = "UTF-8")

means "attempt to re-encode strings as UTF-8 when writing to this
file". However, if you already know your text is UTF-8, then you
likely want to avoid opening a connection that might attempt to
re-encode the input. Conversely (assuming I'm understanding the
documentation correctly)

    file(..., encoding = "native.enc")

means "assume that strings are in the native encoding, and hence
translation is unnecessary". Note that it does not mean "attempt to
translate strings to the native encoding".

Also note that writeLines(..., useBytes = FALSE) will explicitly
translate to the current encoding before sending bytes to the
requested connection. In other words, there are two locations where
translation might occur in your example:

   1) In the call to writeLines(),
   2) When characters are passed to the connection.

In your case, it sounds like translation should be suppressed at both steps.

I think this is documented correctly in ?writeLines (and also the
Encoding section of ?file), but the behavior may feel unfamiliar at
first glance.

Kevin

On Wed, Feb 14, 2018 at 11:36 PM, Davor Josipovic <davorj at live.com> wrote:
>
> I think this behavior is inconsistent with the documentation:
>
>   tmp <- 'é'
>   tmp <- iconv(tmp, to = 'UTF-8')
>   print(Encoding(tmp))
>   print(charToRaw(tmp))
>   tmpfilepath <- tempfile()
>   writeLines(tmp, con = file(tmpfilepath, encoding = 'UTF-8'), useBytes = TRUE)
>
> [1] "UTF-8"
> [1] c3 a9
>
> Raw text as hex: c3 83 c2 a9
>
> If I switch to useBytes = FALSE, then the variable is written correctly as  c3 a9.
>
> Any thoughts? This behavior is related to this issue: https://github.com/yihui/knitr/issues/1509
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list