[R] Extracting text from html code using the RCurl package.

Gabor Grothendieck ggrothendieck at gmail.com
Tue Oct 7 18:52:04 CEST 2008


I gather you are using Windows and in that case you could
use RDCOMClient or rcom to get it via Internet Explorer, e.g.

library(RDCOMClient)
ie <- COMCreate("InternetExplorer.Application")
URL <- "https://stat.ethz.ch/mailman/listinfo/r-help"
ie$Navigate(URL)
while(ie[["Busy"]]) Sys.sleep(1)
txt <- ie[["document"]][["body"]][["innerText"]]
ie$Quit()

You may need to run this in elevated mode if you are Vista.

On Mon, Oct 6, 2008 at 11:45 AM, Tony Breyal <tony.breyal at googlemail.com> wrote:
> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is the html code. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
>> library(RCurl)
>> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
>> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
>> print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
>> library(XML)
>> htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,
> Tony Breyal
>
>
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_1.94-0  RCurl_0.9-4
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list