[R] issue with "strange" characters (locale settings)
R.T.A.J.Leenders
r.t.a.j.leenders at rug.nl
Wed May 4 11:57:46 CEST 2011
WinXP-x32, R-21.13.0
Dear list,
I have a problem that (I think) relates to the interaction between Windows
and R.
I am trying to scrape a table with data on the Hawai'ian Islands, This is my
code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname
> Islands
Island Nickname
Location
1 Hawaiûi[7] The Big Island 19ð34′N 155ð30′W / 19.567
ðN 155.5ðW / 19.567; -155.5
2 Maui[8] The Valley Isle 20ð48′N 156ð20′W / 20.8ðN
156.333ðW / 20.8; -156.333
3 Kahoûolawe[9] The Target Isle 20ð33′N 156ð36′W / 20.55
ðN 156.6ðW / 20.55; -156.6
4 LÃnaûi[10] The Pineapple Isle 20ð50′N 156ð56′W / 20.833ðN 15
6.933ðW / 20.833; -156.933
5 Molokaûi[11] The Friendly Isle 21ð08′N 157ð02′W / 21.133ðN 1
57.033ðW / 21.133; -157.033
6 Oûahu[12] The Gathering Place 21ð28′N 157ð59′W / 21.467ðN 1
57.983ðW / 21.467; -157.983
7 Kauaûi[13] The Garden Isle 22ð05′N 159ð30′W / 22.083
ðN 159.5ðW / 22.083; -159.5
8 Niûihau[14] The Forbidden Isle 21ð54′N 160ð10′W / 21.9ðN
160.167ðW / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried
readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding =
"UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the
Windows settings of the character set.
sessionInfo() gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
>
I have also attempted to let R use another setting by entering:
Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
>
In addition, I have attempted to make the change directly from the windows
command prompt, using: "chcp 65001" and variations of that, but that didn't
change anything.
I have searched the list and the web and have found others bringing forth a
similar issues, but have not been able to find a solution. I looks like this
is an issue of how Windows and R interact. Unfortunately, all three
computers at my disposal have this problem. It occurs both under WinXP-x32
and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be
solved otherwise?
I have also tried other websites, and the issue occurs every time when there
is an é, Ì, À, î, et cetera in the text-to-be-scraped.
Thank you,
Roger
More information about the R-help
mailing list