[Rd] locales and readLines

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Sep 3 11:40:39 CEST 2007


I think you need to delimit a bit more what you want to do.  It is 
difficult in general to tell what encoding a text file is in, and very 
much harder if this is a data file containing only a small proportion of 
non-ASCII text, which might not even be words in a human language (but 
abbreviations or acronyms).

If you have experience with systems that do try to guess (e.g. Unix 
'file') you will know that they are pretty fallible.  There are Perl 
modules available, for example: I checked Encode::Guess which says

    ·   Because of the algorithm used, ISO-8859 series and other single-
        byte encodings do not work well unless either one of ISO-8859 is
        the only one suspect (besides ascii and utf8).

    ·   Do not mix national standard encodings and the corresponding vendor
        encodings.

    It is, after all, just a guess.  You should alway be explicit when it
    comes to encodings.  But there are some, especially Japanese, environ-
    ment that guess-coding is a must.  Use this module with care.


I think you may have missed that the main way to specify an encoding for a 
file is

readLines(file("fn", encoding="latin2"))

and not the encoding arg to readLines (although the help page is quite 
clear that the latter does not re-encode).  The latter only allows UTF-8 
and latin1.

The author of a package that offers facilities to read non-ASCII text does 
need to offer the user a way to specify the encoding.  I think suggesting 
that is 'an extra burden' is exceedingly negative: you could rather be 
thankful that R provides the facilities these days to do so.  And if the 
package or its examples contains non-ASCII character strings, it is de 
rigeur for the author to consider how it might work on other people's 
systems.

Notice that source() already has some of the 'smarts' you are asking about 
if 'file' is a file and not a connection, and you could provide a similar 
wrapper for readLines.  That is useful either when the user can specify a 
small set of possible encodings or when such a set can be deduced from the 
locale.  If the concern is that file might be UTF-8 or latin1, this is 
often a good guess (latin1 files can be valid UTF-8 but rarely are). 
However, if you have Russian text which might be in one of the several 
8-bit encodings, the only way I know to decide which is to see if they 
make sense (and if they are acronyms, they may in all the possible 
encodings).

BTW, to guess an encoding you really need to process all the input, so 
this is not appropriate for general connections, and for large files it 
might be better to do it external to R, e.g. via Perl etc.

I would say minimal good practice would be to

- allow the user to specify the encoding of text files.
- ensure you have specified the encoding of all non-ASCII data in your
   package (which includes documentation, for example).

I'd leave guessing to others: as
http://www.cs.tut.fi/~jkorpela/chars.html says,

   It is hopefully obvious from the preceding discussion that a sequence of
   octets can be interpreted in a multitude of ways when processed as
   character data. By looking at the octet sequence only, you cannot even
   know whether each octet presents one character or just part of a
   two-octet presentation of a character, or something more complicated.
   Sometimes one can guess the encoding, but data processing and transfer
   shouldn't be guesswork.



On Fri, 31 Aug 2007, Martin Morgan wrote:

> R-developers,
>
> I'm looking for some 'best practices', or perhaps an upstream solution
> (I have a deja vu about this, so sorry if it's already been asked).
> Problems occur when a file is encoded as latin1, but the user has a
> UTF-8 locale (or I guess more generally when the input locale does not
> match R's).  Here are two examples from the Bioconductor help list:
>
> https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html
>
> (the relevant command is library(GEOquery); gse <- getGEO('GSE94'))
>
> https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html
>
> I think solutions are:
>
> * Specify the encoding in readLines.
>
> * Convert the input using iconv.
>
> * Tell the user to set their locale to match the input file (!)
>
> Unfortunately, these (1 & 2, anyway) place extra burden on the package
> author, to become educated about locales, the encoding conventions of
> the files they read, and to know how R deals with encodings.
>
> Are there other / better solutions? Any chance for some (additional)
> 'smarts' when reading files?
>
> Martin
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list