[R] Getting htmlParse to work with Hebrew? (on windows)

Milan Bouchet-Valat nalimilan at club.fr
Fri Feb 22 17:04:10 CET 2013


Le jeudi 21 février 2013 à 18:53 +0400, Lawr Eskin a écrit :
> iconv trued before in various try, same issue and result with encoding
> = unknown
> now try sub - same issue
This procedure works on Linux, but not on Windows:

library(RCurl)
library(XML)
u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
a <- getURL(u, .encoding="UTF-8")
a <- iconv(a, "windows-1251", "UTF-8")
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
a2

But maybe the problem is more general, and related to conversion between
encodings on Windows. What looks weird to me is that on Windows, I'm not
able to save a character string to a file in UTF-8, despite what ?file
says:
x <- "Все права защищены"
Encoding(x)
# UTF-8
cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con)
x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con)
Encoding(x2)
# unknown
x2
# [1] "<U+041A><U+0443>..."

I know the problem happens on write because the file cannot be read
correctly on Linux either.

This Windows machine uses Windows Server 2008 with French_France.1252
locale.

> 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
>         Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit :
>         > Hi Milan,
>         >
>         > a <- getURL(con, .encoding = "UTF-8")
>         > Encoding(a)
>         > > [1] "UTF-8"
>         > a # Here - the UTF-8 codes looks like fine.
>         > htmlParse(a, encoding = "UTF-8") ###again same encoding
>         issue
>         
>         And what if you try this:
>         a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
>         
>         or this:
>         a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
>         
>         
>         Cheers
>         
>         
>         > >>why didn't getURL() detect and set a's encoding correctly?
>         > I think there are page issue because another sites works
>         fine
>         >
>         > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
>         >         Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a
>         écrit :
>         >         > Hi Milan!
>         >         >
>         >         >
>         >         > > Encoding(a)
>         >         > [1] "unknown"
>         >
>         >         Hm, here I get "UTF-8", which is my locale encoding.
>         >
>         >         I've tried a little more, and I discovered that
>         using
>         >         a <- getURL(u, .encoding="UTF-8")
>         >         ensures that a is in the correct encoding here. I
>         know this is
>         >         not your
>         >         problem, but it might help: check whether
>         Encoding(a) is set
>         >         to "UTF-8"
>         >         or not in that case, and whether this fixes things.
>         >
>         >         I'm not sure how htmlParse() detects the encoding
>         when you
>         >         pass it a
>         >         character vector, but it probably uses Encoding(a),
>         since
>         >         that's the
>         >         only reliable information; if it is missing, maybe
>         it falls
>         >         back to what
>         >         the contents of the file say (maybe even before what
>         the
>         >         "encoding"
>         >         argument says), which is windows-1251, and may not
>         be the
>         >         encoding in
>         >         which getURL() saved the character vector. The
>         question would
>         >         then be:
>         >         why didn't getURL() detect and set a's encoding
>         correctly?
>         >
>         >
>         >         My two cents
>         >
>         >
>         >         > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
>         >         >         Le jeudi 21 février 2013 à 13:16 +0400,
>         Lawr Eskin a
>         >         écrit :
>         >         >         > Hello dear R-help mailing list.
>         >         >         >
>         >         >         >
>         >         >         > Looks like the same issue in Russian:
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > library(RCurl)
>         >         >         >
>         >         >         > library(XML)
>         >         >         >
>         >         >         > u = "
>         >         >
>         >
>         http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
>         >         >         >
>         >         >         > a = getURL(u)
>         >         >         >
>         >         >         > a # Here - the Russian is fine.
>         >         >         >
>         >         >         > a2 <- htmlParse(a)
>         >         >         >
>         >         >         > a2 # Here it is a mess...
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > None of these seem to fix it:
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > htmlParse(a, encoding = "windows-1251")
>         >         >         >
>         >         >         > htmlParse(a, encoding = "CP1251")
>         >         >         >
>         >         >         > htmlParse(a, encoding = "cp1251")
>         >         >         >
>         >         >         > htmlParse(a, encoding = "iso8859-5")
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > This is my locale:
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > Sys.getlocale()
>         >         >         >
>         >         >         >
>         >         >
>         >
>         "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >         > Any suggestions?
>         >         >
>         >         >         What does Encoding(a) say?
>         >         >
>         >         >
>         >         >         (FWIW, here on Linux even a is not in the
>         correct
>         >         encoding :
>         >         >         <!DOCTYPE html PUBLIC "-//W3C//DTD HTML
>         4.0
>         >         Transitional//EN"
>         >         >
>         "http://www.w3.org/TR/REC-html40/loose.dtd">
>         >         >         <html><head>
>         >         >         <title>Êóïèòü îäíîêîìíà òí
>         ГіГѕ ГЄГўГ
>         >         ðòèð
>         >         >         Гі Гў ГЊГ®Г
>         >         >         ±ГЄГўГҐ В— 11430 îáúÿâëåíèé Г®
>         ïðîäГ
>         >         æå îäí
>         >         >         îêîìí
>         >         >         à òíûõ êâà ðòèð</title>
>         >         >         [...])
>         >         >
>         >         >
>         >         >         Regards
>         >         >
>         >         >
>         >         >         > Thanks you very much in advance,
>         >         >         >
>         >         >         >     Lavrentiy Eskin
>         >         >
>         >         >         >  <http://www.eng.nvg.ru>
>         >         >         >
>         >         >         >       [[alternative HTML version
>         deleted]]
>         >         >         >
>         >         >         >
>         ______________________________________________
>         >         >         > R-help at r-project.org mailing list
>         >         >         >
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         >         >         > PLEASE do read the posting guide
>         >         >
>         http://www.R-project.org/posting-guide.html
>         >         >         > and provide commented, minimal,
>         self-contained,
>         >         reproducible
>         >         >         code.
>         >         >
>         >         >
>         >
>         >
>         >
>         
>         
>



More information about the R-help mailing list