[R] [Rd] source(echo = TRUE) with a iso-8859-1 encoded file gives an error

Ista Zahn istazahn at gmail.com
Sat May 5 00:58:26 CEST 2018


On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak <skostyshak at ufl.edu> wrote:
> I have very little knowledge about file encodings and would like to
> learn more.
>
> I've read the following pages to learn more:
>
>   http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
>   https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv
>   https://developer.r-project.org/Encodings_and_R.html
>
> The last one, in particular, has been very helpful. I would be
> interested in any further references that you suggest.
>
> I attach a file that reproduces the issue I would like to learn more
> about. I do not know if the file encoding will be correctly preserved
> through email, so I also provide the file (temporarily) on Dropbox here:
>
>   https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0
>
> The file gives an error when using "source()" with the
> argument echo = TRUE:
>
>   > source("encoding_export_issue.R", echo = TRUE)
>   Error in nchar(dep, "c") : invalid multibyte string, element 1
>   In addition: Warning message:
>   In grepl("^[[:blank:]]*$", dep[1L]) :
>     input string 1 is invalid in this locale
>
> The problem comes from the "á" character in the .R file. The file
> appears to be encoded as "iso-8859-1":
>
>   $ file --mime-encoding encoding_export_issue.R
>   encoding_export_issue.R: iso-8859-1
>
> Note that for me:
>
>   > getOption("encoding")
>   [1] "native.enc"
>
> so "native.enc" is used for the "encoding" argument of source().
>
> The following two calls succeed:
>
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
>   > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")
>
> Is this file a valid "iso-8859-1" encoded file?

The one you attached is not. The one linked to in dropbox is.

 Why does source() fail
> in the case of encoding set to "native.enc"? Is it because of the
> settings to UTF-8 in my locale (see info on my system at the bottom of
> this email).

Yes.

>
> I'm guessing it would be a bad idea to put
>
>   options(encoding = "unknown")
>
> in my .Rprofile, because it is difficult to always correctly guess the
> encoding of files?

My guess is that the issue is less about the difficulty of guessing
the encoding, and more about the time it takes to do so. That's not
particularly relevant for the "source" function, but the encoding
option is used by many of the file IO functions in R and so has
implications well beyond the behavior of "source".

 Is there a reason why setting it to "unknown" would
> lead to more problems than leaving it set to "native.enc"?

It depends on what you are actually doing. If you are on a UTF-8
locale and working exclusively with UTF-8 files, setting
options(encoding = "unknown") will just slow down your file IO by
checking for the encoding every time.
>
> I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
> is my session info and locale info for my system with the 3.4.3 version:
>
>> sessionInfo()
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 16.04.3 LTS
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.6.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.3
>
>> Sys.getlocale()
> [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
>
> Thanks for your time,
>
> Scott
>
> P.S. Note that I had posted this question to r-devel, which was the
> incorrect choice. For archival purposes, I reference the thread here:
>
> https://www.mail-archive.com/search?l=mid&q=20180501185750.445oub53vcdnyyyx%40steph
>
>
> --
> Scott Kostyshak
> Assistant Professor of Economics
> University of Florida
> https://people.clas.ufl.edu/skostyshak/
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list