[Rd] Non-ASCII chars in R code
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri May 19 12:04:15 CEST 2006
A little more digging revealed a Unix/Windows discrepancy here.
On Unix, saving images and preparing for lazyloading/lazydata is done with
LC_ALL=C: on Windows with LC_COLLATE=C. I will change Windows to match.
Unfortunately how the C locale is implemented is OS-dependent. Strictly
it should not allow bytes 0x80 to 0xff but it does on some OSes (including
Windows). So the strict consequences of this should be that when using
lazy-loading or a saved image
- all names have to be ASCII alphanumeric
- \uxxxx sequences are not allowed except \u007f and lower (they are not
valid at all in a C locale prior to 2.3.1 so I would not expect to see
them in a package).
- bytes in character strings are copied byte for byte.
This leaves an inconsistency between packages which use lazy-loading /
save image and those which do not. We could resolve that by switching to
the C locale when loading R code in packages (or, better, R code that was
not a loader stub): I didn't think that would be worthwhile but in fact 5
of the packages listed are small enough not to be lazy-loaded.
The other consequence is that the only way we allow packages to have
object names which are not ASCII alphanumeric is to disable lazy loading.
One possibility is to allow a package to specify its required locale for
loading in the DESCRIPTION file, and make use of that.
I am inclined to do nothing about these issues unless people have an
actual need to have packages tailored on a non-English locale.
On Wed, 17 May 2006, Prof Brian Ripley wrote:
> The report on R_help about problems loading package irr (in a UTF-8 locale,
> it seemed) prompted me to look a little deeper. There are quite a few
> packages with Latin-1 chars in their .R files, and a couple in UTF-8.
>
> Apart from non-ASCII chars in comments, this is a problem as the code
> concerned cannot be represented in some locales R runs in (for example
> Japanese on Windows). It happens that irr is so small that lazy-loading is
> not used, but when lazy-loading or a saved image is used, the locale in use
> when the package is installed determines how the code is parsed (and may not
> be the same as when the package is used, and indeed it is not uncommon on
> Linux/Unix systems for different users to use different locales).
>
> This means that using non-ASCII chars is not portable, and I've added code to
> R CMD check in R-devel to warn about such usage. In the examples I have
> investigated the usages have been
>
> - messages in a non-English language, typically French.
> - startup messages with people's names.
> - use of characters that I can only guess are intended to be in the
> WinAnsi encoding, e.g. a copyright symbol.
>
> The only reason I have not made this an error is that people might want to
> produce packages for a known locale, e.g. a student class, but perhaps it
> should be an error for packages submitted to CRAN.
>
> I do not believe there is much we can do about this: messages which are not
> entirely in ASCII cannot be displayed on many R platforms and it seems
> incorrect to allow French messages and not Japanese ones.
>
> The packages currently throwing warnings are
>
> FactoMineR FunCluster JointGLM LoopAnalyst Sciviews ade4 adehabitat ape
> climatol crossdes deal grasper irr lsa mvrpart pastecs sn surveillance
> truncgof
>
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list