[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

Thu Dec 11 18:59:46 CET 2014

SUGGESTION:
Would it make sense if install.packages() and friends always use an
"ascii"(*) encoding when parse():ing R package source code files?

I believe this should be safe, because R code files should be in ASCII
[http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
you may use other characters.  This is from Section 'Package
subdirectories' in 'Writing R Extensions':

"Only ASCII characters (and the control characters tab, formfeed, LF
and CR) should be used in code files. Other characters are accepted in
comments, but then the comments may not be readable in e.g. a UTF-8
locale. Non-ASCII characters in object names will normally fail when
the package is installed. Any byte will be allowed in a quoted
character string but \uxxxx escapes should be used for non-ASCII
characters. However, non-ASCII character strings may not be usable in
some locales and may display incorrectly in others."

Since comments are dropped by parse(), their actual content does not
matter, and the rest of the code should be in ASCII.

(*) It could be that the specific encoding "ascii" is not cross
platforms. If so, is there another way to specify a pure ASCII
encoding?

BACKGROUND:
If a user/system sets the 'encoding' option at startup, it may break
package installations from source if the package has source code
comments with non-ASCII characters.  For example,

$ mkdir foo; cd foo
$ echo "options(encoding='UTF-8')" > .Rprofile
$ R --vanilla
> install.packages("R.oo", type="source")

> install.packages("R.oo", type="source")
Installing package into 'C:/Users/hb/R/win-library/3.2'
(as 'lib' is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
Content type 'application/x-gzip' length 394545 bytes (385 KB)
opened URL
downloaded 385 KB

* installing *source* package 'R.oo' ...
** package 'R.oo' successfully unpacked and MD5 sums checked
** R
Warning in parse(outFile) :
  invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/
R.oo'
** inst
** preparing package for lazy loading
Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
  invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/
R.oo'
** help
[...]

(This can be an extremely time consuming task to troubleshoot,
particularly if reported to a package maintainer not having access to
the original system).

FYI, setting it only in the session is alright:

> options(encoding="UTF-8")
> install.packages("R.oo", type="source")

because install.packages() launches a separated R process for the
installation and it's only then the startup code becomes an issue.

TROUBLESHOOTING:
My understanding for the

Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
  invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/

is that this happens when there is a non-ASCII character in one of the
source-code comments (*) with a bit pattern matching a multi-byte
UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
instance, consider a source code comment with an acute accent:

> raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e, 0x74, 0x0a))
> writeBin(raw, con="foo.R")
> code <- readLines("foo.R")
> code
[1] "# étudiant"

> options(encoding="UTF-8")
> parse("foo.R")
Warning message:
In readLines(file, warn = FALSE) :
  invalid input found on input connection 'foo.R'

> options(encoding="ascii")
> parse("foo.R")
expression()

Reason for the "invalid input": The bit pattern for raw[3:5], is:

> R.utils::intToBin(raw[3:5])
[1] "11101001" "01110100" "01110101"

The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx",
which according to UTF-8 should be followed by two more bytes with bit
patterns "10xxxxxx" and "10xxxxx"
[http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
not match those, it's an invalid UTF-8 byte sequence.  So, technically
this does not happen for all comments using acute accents, but it's
very likely.  More generally, a multi-byte UTF-8 sequence is expected
when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered.
Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
characters with this bit pattern for many "Latin-N" encodings, which
I'd assume is still in dominant use by many developers.

So, since options(encoding="UTF-8") was set at startup, that is also
the encoding that R tries to follow.  My suggestion is that it seems
that R should be able to always use a pure-ASCII encoding when parsing
R code in packages, because that is what 'Writing R Extensions' says
we should use in the first place.

/Henrik