[Rd] Error: invalid multibyte string

Henrik Bengtsson hb at stat.berkeley.edu
Fri Oct 27 03:38:34 CEST 2006


In Section "Package subdirectories" in "Writing R Extensions" [2.4.0
(2006-10-10)] it says:

"Only ASCII characters (and the control characters tab, formfeed, LF
and CR) should be used in code files. Other characters are accepted in
comments, but then the comments may not be readable in e.g. a UTF-8
locale. Non-ASCII characters in object names will normally [1] fail
when the package is installed. Any byte will be allowed [2] in a
quoted character string (but \uxxxx escapes should not be used), but
non-ASCII character strings may not be usable in some locales and may
display incorrectly in others.", where the footnote [2] reads "It is
good practice to encode them as octal or hex escape sequences".

(Note: ASCII refers (correctly) to the 7-bit ASCII [0-127] and none of
the 8-bit ASCII extensions [128-255].)

According to sentense about quoted strings, the following R/*.R code
should still be valid:

    pads <- sapply(0:64, FUN=function(x) paste(rep("\xFF", x), collapse=""));

or as we first had:

    pads <- sapply(0:64, FUN=function(x) paste(rep("\377", x), collapse=""));

Is R CMD check, or more precisely
tools:::.check_packages_used(dir=\"${pkgdir}\") in (perl script
bin/check) too picky?  In check_packages_used() there is an internal
function find_bad_exprs() trying to identify "bad expressions", and it
is when it tries to deparse() the above parse():ed code it complains.
This is exactly what Peter pointed out in his example.

Cheers

Henrik





On 26 Oct 2006 18:43:45 +0200, Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:
> Thomas Lumley <tlumley at u.washington.edu> writes:
>
> > On Thu, 26 Oct 2006, Henrik Bengtsson wrote:
> >
> > > I'm observing the following on different platforms:
> > >
> > >> parse(text='"\\x7F"')
> > > expression("\177")
> > >> parse(text='"\\x80"')
> > > Error: invalid multibyte string
> >
> > Yes. It's an invalid multibyte string.  In UTF-8 a single byte is a valid
> > character string only if it is below x80, so x7F is fine but x80 is not.
> > In fact x80 is not the leading byte of any valid UTF-8 character.
> >
> > You have to work out what the Unicode code point is for whatever character
> > you were expecting to be x80 and convert that to UTF-8.
> >
> > I'm surprised that one of your UTF-8 machines worked -- I don't think it
> > should.
>
> Interestingly, we can parse, but not print or deparse:
>
> > x<-parse(text='"\\x80"')
> > x
> Error: invalid multibyte string
> > z <- deparse(x)
> Error in deparse(x) : invalid multibyte string
> > cat(x[[1]])
> �>
>
> (the last line has a funny little cedilla-like symbol in pos 1)
>
> --
>    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907
>




More information about the R-devel mailing list