[R] Purpose of readLines(..., encoding=)?

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Apr 5 15:16:51 CEST 2014


UTF-8 is treated specially by readLines(), originally to allow for UTF-8 
strings on Windows.  See the NEWS for 2.12.0.

That is not the case for encoding = "latin1".

If you have a Latin-1 file in a UTF-8 locale, then

readLines(x, encoding = "latin1")

stores the strings in Latin-1 and marks them, and

readLines(file(x, encoding = "latin1"))

translates the strings to UTF-8 and marks them as such.

There can be advantages to the first, including speed and less storage 
space.  Also to the second (e.g. translating once may be better if the 
strings are to be manipulated by character-level functions).

Prior to 2.12.0 there were differences for UTF-8 files, and even now
readLines(x, encoding="UTF-8") is more convenient (no encoding left open 
as your first example will).

On 05/04/2014 11:54, Milan Bouchet-Valat wrote:
> Hi!
>
> I'm wondering what's the use of the 'encoding' argument to readLines(x),
> as opposed to readLines(file(x, encoding=)). The same question applies
> to read.table()'s 'encoding' vs 'fileEncoding' arguments. AFAIK only the
> latter is able to re-encode the read text into the internal
> representation used by R (let's say when reading files in encodings
> other than latin1 and UTF-8). But then what's the purpose of the former?
>
> ?readLines says:
> encoding: encoding to be assumed for input strings.  It is used to mark
>            character strings as known to be in Latin-1 or UTF-8: it is
>            not used to re-encode the input.  To do the latter, specify
>            the encoding as part of the connection ‘con’ or via
>            ‘options(encoding=)’: see the example under ‘file’.
>
> But if I have a UTF-8 text file to read, couldn't I use
> readLines(file(x, encoding="UTF-8"))
> instead of
> readLines(x, encoding="UTF-8")
>
> In my experience resulting character strings are marked as UTF-8 where
> needed as well.
>
> The reason I'm asking this is because I need to decide whether I should
> allow users of a tm source plug-in to pass both (à la 'encoding' vs
> 'fileEncoding') or whether I could safely skip the first one.
>
>
> Thanks for your help
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list