[R] how to read a website with Chinese Character

Hui Du Hui.Du at dataventures.com
Thu Jan 24 04:56:54 CET 2013


Thanks a lot. 

y <- iconv(x, "gb2312", "utf-8") does not work but

y <- iconv(x, "gb2312", "UTF8") works on my machine. Thank you for pointing to the right direction.


-----Original Message-----
From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com] 
Sent: Wednesday, January 23, 2013 6:16 PM
To: Hui Du
Cc: r-help at r-project.org
Subject: Re: [R] how to read a website with Chinese Character

On 13-01-23 8:19 PM, Hui Du wrote:
> Hi all,
>
> I am planning to parse some information on a website which includes lots of Chinese characters. Does someone know how to read/display Chinese in R? Thanks.
>
>
> url = "http://www.teec.org.cn/html/renwujieshao/"
> x = readLines(url)

If you look at the first few lines of x you'll see this:

 > head(x)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 
Transitional//EN\"\t\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
[2] "<html xmlns=\"http://www.w3.org/1999/xhtml\">" 

[3] "<head>" 

[4] "<meta http-equiv=\"Content-Type\" content=\"text/html; 
charset=gb2312\" />"

At the end of line 4 it shows "charset=gb2312".  I didn't think that was 
an encoding, but this seems to do the conversion:

y <- iconv(x, "gb2312", "utf-8")
y

(I don't know if that will display properly on your Windows machine; it 
doesn't work on mine, because I don't have the fonts installed.  But it 
does work on my Mac.)

Duncan Murdoch
>
> I tried encoding = 'UTF-8' already but it didn't help.
>
> My R version is
> $platform
> [1] "i386-pc-mingw32"
>
> $arch
> [1] "i386"
>
> $os
> [1] "mingw32"
>
> $system
> [1] "i386, mingw32"
>
> $status
> [1] ""
>
> $major
> [1] "2"
>
> $minor
> [1] "15.0"
>
>
> HXD
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list