[R] how to read a website with Chinese Character
Hui Du
Hui.Du at dataventures.com
Thu Jan 24 04:56:54 CET 2013
Thanks a lot.
y <- iconv(x, "gb2312", "utf-8") does not work but
y <- iconv(x, "gb2312", "UTF8") works on my machine. Thank you for pointing to the right direction.
-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com]
Sent: Wednesday, January 23, 2013 6:16 PM
To: Hui Du
Cc: r-help at r-project.org
Subject: Re: [R] how to read a website with Chinese Character
On 13-01-23 8:19 PM, Hui Du wrote:
> Hi all,
>
> I am planning to parse some information on a website which includes lots of Chinese characters. Does someone know how to read/display Chinese in R? Thanks.
>
>
> url = "http://www.teec.org.cn/html/renwujieshao/"
> x = readLines(url)
If you look at the first few lines of x you'll see this:
> head(x)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0
Transitional//EN\"\t\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
[2] "<html xmlns=\"http://www.w3.org/1999/xhtml\">"
[3] "<head>"
[4] "<meta http-equiv=\"Content-Type\" content=\"text/html;
charset=gb2312\" />"
At the end of line 4 it shows "charset=gb2312". I didn't think that was
an encoding, but this seems to do the conversion:
y <- iconv(x, "gb2312", "utf-8")
y
(I don't know if that will display properly on your Windows machine; it
doesn't work on mine, because I don't have the fonts installed. But it
does work on my Mac.)
Duncan Murdoch
>
> I tried encoding = 'UTF-8' already but it didn't help.
>
> My R version is
> $platform
> [1] "i386-pc-mingw32"
>
> $arch
> [1] "i386"
>
> $os
> [1] "mingw32"
>
> $system
> [1] "i386, mingw32"
>
> $status
> [1] ""
>
> $major
> [1] "2"
>
> $minor
> [1] "15.0"
>
>
> HXD
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list