[R] Purpose of readLines(..., encoding=)?

Milan Bouchet-Valat nalimilan at club.fr
Sat Apr 5 17:15:36 CEST 2014


Le samedi 05 avril 2014 à 14:16 +0100, Prof Brian Ripley a écrit :
> UTF-8 is treated specially by readLines(), originally to allow for UTF-8 
> strings on Windows.  See the NEWS for 2.12.0.
> 
> That is not the case for encoding = "latin1".
> 
> If you have a Latin-1 file in a UTF-8 locale, then
> 
> readLines(x, encoding = "latin1")
> 
> stores the strings in Latin-1 and marks them, and
> 
> readLines(file(x, encoding = "latin1"))
> 
> translates the strings to UTF-8 and marks them as such.
> 
> There can be advantages to the first, including speed and less storage 
> space.  Also to the second (e.g. translating once may be better if the 
> strings are to be manipulated by character-level functions).
Thanks for the detailed explanation. So in the case at hand I will only
retain 'fileEncoding', as tm corpora are expected to be converted to
UTF-8 (anything other than that is asking for trouble anyway).

> Prior to 2.12.0 there were differences for UTF-8 files, and even now
> readLines(x, encoding="UTF-8") is more convenient (no encoding left open 
> as your first example will).
I guess you meant "no connection left open"?

Indeed. I think it would be nice to add a 'fileEncoding' argument to
readLines(), just like the one passed to read.table(). This would be
more convenient than creating the connection and closing it after you're
done, and it would reduce the confusion for newcomers who try using
'encoding' when they really need the other solution. I could prepare a
patch to do this if you want.


Regards




More information about the R-help mailing list