[R] Chinese characters in html source captured by download.file() are garbled code , how to convert it readable

Henrik Bengtsson hb at biostat.ucsf.edu
Mon Jul 29 18:03:52 CEST 2013


Try with adding mode="wb" to download.file(), or just use
downloadFile() of R.utils.

/Henrik

On Sun, Jul 28, 2013 at 8:32 PM, Yong Wang <wangyong1 at gmail.com> wrote:
> Dear list,
> I am working with R to download numerous html source code from which the
> data extracted will be further processed.
> The problem is the Chinese character in the html source code are all
> garbled and I can't really find a way to convert them to something readable.
> This problem persists on ubuntu-10 and win-7, English environment. Not try
> Operating system in Chinese yet.
> I know literally nothing about encoding and a comprehensive search online
> does not save me from this woe.
>
> # the code
> download.file("
> https://www.google.com.hk/finance/company_news?q=SHA:601857&gl=cn&num=200
> ",destfile="tmp.txt")
> test<-readLines("tmp.txt",encoding="UTF-8")
>
>     #the garbled code in "tmp.txt" and "test" is like below
>     #��国�۪o�ѵM�a�ѥ��������q�]�
>
>
> Any help is highly appreciated.
>
> yong
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list