[R] Getting htmlParse to work with Hebrew? (on windows)

Milan Bouchet-Valat nalimilan at club.fr
Thu Feb 21 11:08:24 CET 2013


Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a écrit :
> Hello dear R-help mailing list.
> 
> 
> Looks like the same issue in Russian:
> 
> 
> 
> library(RCurl)
> 
> library(XML)
> 
> u = " http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
> 
> a = getURL(u)
> 
> a # Here - the Russian is fine.
> 
> a2 <- htmlParse(a)
> 
> a2 # Here it is a mess...
> 
> 
> 
> None of these seem to fix it:
> 
> 
> 
> htmlParse(a, encoding = "windows-1251")
> 
> htmlParse(a, encoding = "CP1251")
> 
> htmlParse(a, encoding = "cp1251")
> 
> htmlParse(a, encoding = "iso8859-5")
> 
> 
> 
> This is my locale:
> 
> 
> 
> Sys.getlocale()
> 
> "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> 
> 
> 
> Any suggestions?
What does Encoding(a) say?


(FWIW, here on Linux even a is not in the correct encoding :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head>
<title>ГЉГіГЇГЁГІГј îäíîêîìíà òíóþ êâà ðòèðó Гў ГЊГ®Г
±ГЄГўГҐ В— 11430 îáúÿâëåíèé Г® ïðîäà æå îäíîêîìí
à òíûõ êâà ðòèð</title>
[...])


Regards


> Thanks you very much in advance,
> 
>     Lavrentiy Eskin
>  <http://www.eng.nvg.ru>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list