iconv trued before in various try, same issue and result with encoding = unknown now try sub - same issue 2013/2/21 Milan Bouchet-Valat > Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit : > > Hi Milan, > > > > a <- getURL(con, .encoding = "UTF-8") > > Encoding(a) > > > [1] "UTF-8" > > a # Here - the UTF-8 codes looks like fine. > > htmlParse(a, encoding = "UTF-8") ###again same encoding issue > And what if you try this: > a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) > > or this: > a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8")) > > > Cheers > > > > >>why didn't getURL() detect and set a's encoding correctly? > > I think there are page issue because another sites works fine > > > > 2013/2/21 Milan Bouchet-Valat > > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a écrit : > > > Hi Milan! > > > > > > > > > > Encoding(a) > > > [1] "unknown" > > > > Hm, here I get "UTF-8", which is my locale encoding. > > > > I've tried a little more, and I discovered that using > > a <- getURL(u, .encoding="UTF-8") > > ensures that a is in the correct encoding here. I know this is > > not your > > problem, but it might help: check whether Encoding(a) is set > > to "UTF-8" > > or not in that case, and whether this fixes things. > > > > I'm not sure how htmlParse() detects the encoding when you > > pass it a > > character vector, but it probably uses Encoding(a), since > > that's the > > only reliable information; if it is missing, maybe it falls > > back to what > > the contents of the file say (maybe even before what the > > "encoding" > > argument says), which is windows-1251, and may not be the > > encoding in > > which getURL() saved the character vector. The question would > > then be: > > why didn't getURL() detect and set a's encoding correctly? > > > > > > My two cents > > > > > > > 2013/2/21 Milan Bouchet-Valat > > > Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a > > écrit : > > > > Hello dear R-help mailing list. > > > > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > > > > > library(RCurl) > > > > > > > > library(XML) > > > > > > > > u = " > > > > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > > > a = getURL(u) > > > > > > > > a # Here - the Russian is fine. > > > > > > > > a2 <- htmlParse(a) > > > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > > > > > Any suggestions? > > > > > > What does Encoding(a) say? > > > > > > > > > (FWIW, here on Linux even a is not in the correct > > encoding : > > > > Transitional//EN" > > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > > > > > ГЉГіГЇГЁГІГј îäíîêîìíà òíóþ ГЄГўГ > > ðòèð > > > Гі Гў Ìîà > > > ±ГЄГўГҐ В— 11430 îáúÿâëåíèé Г® ïðîäà > > æå îäí > > > îêîìí > > > à òíûõ êâà ðòèð > > > [...]) > > > > > > > > > Regards > > > > > > > > > > Thanks you very much in advance, > > > > > > > > Lavrentiy Eskin > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, > > reproducible > > > code. > > > > > > > > > > > > > > [[alternative HTML version deleted]]