[Rd] Non-ASCII chars in R code

Fri May 19 12:04:15 CEST 2006

A little more digging revealed a Unix/Windows discrepancy here.

On Unix, saving images and preparing for lazyloading/lazydata is done with 
LC_ALL=C: on Windows with LC_COLLATE=C.  I will change Windows to match.

Unfortunately how the C locale is implemented is OS-dependent.  Strictly 
it should not allow bytes 0x80 to 0xff but it does on some OSes (including 
Windows).  So the strict consequences of this should be that when using
lazy-loading or a saved image

- all names have to be ASCII alphanumeric
- \uxxxx sequences are not allowed except \u007f and lower (they are not
   valid at all in a C locale prior to 2.3.1 so I would not expect to see
   them in a package).
- bytes in character strings are copied byte for byte.

This leaves an inconsistency between packages which use lazy-loading / 
save image and those which do not.  We could resolve that by switching to 
the C locale when loading R code in packages (or, better, R code that was 
not a loader stub): I didn't think that would be worthwhile but in fact 5 
of the packages listed are small enough not to be lazy-loaded.

The other consequence is that the only way we allow packages to have 
object names which are not ASCII alphanumeric is to disable lazy loading.
One possibility is to allow a package to specify its required locale for 
loading in the DESCRIPTION file, and make use of that.

I am inclined to do nothing about these issues unless people have an 
actual need to have packages tailored on a non-English locale.

On Wed, 17 May 2006, Prof Brian Ripley wrote:

> The report on R_help about problems loading package irr (in a UTF-8 locale, 
> it seemed) prompted me to look a little deeper.  There are quite a few 
> packages with Latin-1 chars in their .R files, and a couple in UTF-8.
>
> Apart from non-ASCII chars in comments, this is a problem as the code 
> concerned cannot be represented in some locales R runs in (for example 
> Japanese on Windows).  It happens that irr is so small that lazy-loading is 
> not used, but when lazy-loading or a saved image is used, the locale in use 
> when the package is installed determines how the code is parsed (and may not 
> be the same as when the package is used, and indeed it is not uncommon on 
> Linux/Unix systems for different users to use different locales).
>
> This means that using non-ASCII chars is not portable, and I've added code to 
> R CMD check in R-devel to warn about such usage.  In the examples I have 
> investigated the usages have been
>
> - messages in a non-English language, typically French.
> - startup messages with people's names.
> - use of characters that I can only guess are intended to be in the
>  WinAnsi encoding, e.g. a copyright symbol.
>
> The only reason I have not made this an error is that people might want to 
> produce packages for a known locale, e.g. a student class, but perhaps it 
> should be an error for packages submitted to CRAN.
>
> I do not believe there is much we can do about this: messages which are not 
> entirely in ASCII cannot be displayed on many R platforms and it seems 
> incorrect to allow French messages and not Japanese ones.
>
> The packages currently throwing warnings are
>
> FactoMineR FunCluster JointGLM LoopAnalyst Sciviews ade4 adehabitat ape 
> climatol crossdes deal grasper irr lsa mvrpart pastecs sn surveillance 
> truncgof
>
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595