[R] issue with "strange" characters (readHTMLTable)

R.T.A.J.Leenders r.t.a.j.leenders at rug.nl
Thu May 5 11:33:45 CEST 2011


   Thank you. The line of code you give certainly resolves several of the
   issues.
   I didn't realize that font support is such a tough matter to realize. Let me
   express my gratitude to those who provide this for us in R.
   On 04-05-11, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

   Oh, please!
   This is about the contributed package XML, not R and not Windows.
   Some of us have worked very hard to provide reasonable font support in R,
   including on Windows.  We are given exceedingly little credit, just
   the brickbats for things for which we are not responsible.  (We even work
   hard to port XML to Windows for you, again with almost zero credit.)
   That URL is a page in UTF-8, as its header says.  We have provided many ways
   to work with UTF-8 on Windows, but it seems readHTMLTable() is not making
   use of them.
   You need to run iconv() on the strings in your object (which as it has
   factors, are the levels).  When you do so, you will discover that page
   contains characters not in your native charset (I presume, not having your
   locale).
   What you can do, in Rgui only, is
   for (n in names(Islands)) Encoding(levels(Islands[[n]])) <-"UTF-8"
   but likely there are still characters it will not know how to display.
   On Wed, 4 May 2011, R.T.A.J.Leenders wrote:
   >
   >  WinXP-x32, R-21.13.0
   >  Dear list,
   >   I have a problem that (I think) relates to the interaction between
   Windows
   >  and R.
   >  I am trying to scrape a table with data on the Hawai'ian Islands, This is
   my
   >  code:
   >  library(XML)
   >  u <- "[1]http://en.wikipedia.org/wiki/Hawaii"
   >  tables <- readHTMLTable(u)
   >  Islands <- tables[[5]]
   >  The output is (first set of columns):
   >         Island            Nickname
   >                      > Islands
   >         Island            Nickname
   >                      Location
   >1       HawaiÃ?»i[7]        The   Big   Island       19Ã?°34′N
   155Ã?°30′W / 19.567
   >�°N 155.5�°W / 19.567; -155.5
   >2           Maui[8]        The    Valley   Isle       20Ã?°48′N
   156Ã?°20′W / 20.8Ã?°N
   >156.333�°W / 20.8; -156.333
   >3   KahoÃ?»olawe[9]       The   Target   Isle        20Ã?°33′N
   156Ã?°36′W / 20.55
   >�°N 156.6�°W / 20.55; -156.6
   >4      LÃ?naÃ?»i[10]     The    Pineapple    Isle    20Ã?°50′N
   156Ã?°56′W / 20.833Ã?°N 15
   >6.933�°W / 20.833; -156.933
   >5     MolokaÃ?»i[11]      The    Friendly    Isle    21Ã?°08′N
   157Ã?°02′W / 21.133Ã?°N 1
   >57.033�°W / 21.133; -157.033
   >6        OÃ?»ahu[12]    The    Gathering    Place    21Ã?°28′N
   157Ã?°59′W / 21.467Ã?°N 1
   >57.983�°W / 21.467; -157.983
   >7       KauaÃ?»i[13]       The   Garden   Isle       22Ã?°05′N
   159Ã?°30′W / 22.083
   >�°N 159.5�°W / 22.083; -159.5
   >8      NiÃ?»ihau[14]    The   Forbidden   Isle       21Ã?°54′N
   160Ã?°10′W / 21.9Ã?°N
   >160.167�°W / 21.9; -160.167
   >
   >  As you can see, there are "weird" characters in there. I have also tried
   >  readHTMLTable(u,  encoding = "UTF-16") and readHTMLTable(u, encoding =
   >  "UTF-8")
   >  but that didn't help.
   >  It  seems to me that there may be an issue with the interaction of the
   >  Windows settings of the character set.
   >  sessionInfo() gives
   >  > sessionInfo()
   >  R version 2.13.0 (2011-04-13)
   >  Platform: i386-pc-mingw32/i386 (32-bit)
   >  locale:
   >  [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
   >  LC_MONETARY=Dutch_Netherlands.1252
   >  [4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252
   >  attached base packages:
   >  [1] stats     graphics  grDevices utils     datasets  methods   base
   >  other attached packages:
   >  [1] XML_3.2-0.2
   >  >
   >  I  have  also  attempted  to  let  R  use another setting by entering:
   >  Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
   >  > Sys.setlocale("LC_ALL", "en_US.UTF-8")
   >  [1] ""
   >  Warning message:
   >  In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
   >    OS reports request to set locale to "en_US.UTF-8" cannot be honored
   >  >
   >   In addition, I have attempted to make the change directly from the
   windows
   >  command prompt, using: "chcp 65001" and variations of that, but that
   didn't
   >  change anything.
   >  I have searched the list and the web and have found others bringing forth
   a
   >  similar issues, but have not been able to find a solution. I looks like
   this
   >  is  an  issue  of how Windows and R interact. Unfortunately, all three
   >   computers  at  my disposal have this problem. It occurs both under
   WinXP-x32
   >  and under Win7-x86.
   >  Is there a way to make R override the windows settings or can the issue
   be
   >  solved otherwise?
   >  I have also tried other websites, and the issue occurs every time when
   there
   >  is an é, Ì, À, î, et cetera in the text-to-be-scraped.
   >  Thank you,
   >  Roger
   >______________________________________________
   >R-help at r-project.org mailing list
   >[2]https://stat.ethz.ch/mailman/listinfo/r-help
   >PLEASE do read the posting guide
   [3]http://www.R-project.org/posting-guide.html
   >and provide commented, minimal, self-contained, reproducible code.
   >
   --
   Brian D. Ripley,                  ripley at stats.ox.ac.uk
   Professor of Applied Statistics,  [4]http://www.stats.ox.ac.uk/~ripley/
   University of Oxford,             Tel:  +44 1865 272861 (self)
   1 South Parks Road,                     +44 1865 272866 (PA)
   Oxford OX1 3TG, UK                Fax:  +44 1865 272595

References

   1. http://en.wikipedia.org/wiki/Hawaii
   2. https://stat.ethz.ch/mailman/listinfo/r-help
   3. http://www.R-project.org/posting-guide.html
   4. http://www.stats.ox.ac.uk/%7Eripley/


More information about the R-help mailing list