[R] Extracting text from html code using the RCurl package.

Tony Breyal tony.breyal at googlemail.com
Mon Oct 6 17:45:55 CEST 2008


Dear R-help,

I want to download the text from a web page, however what i end up
with is the html code. Is there some option that i am missing in the
RCurl package? Or is there another way to achieve this? This is the
code i am using:

> library(RCurl)
> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
> print(html.file)

I thought perhaps the htmlTreeParse() function from the XML package
might help, but I just don't know what to do next with it:

> library(XML)
> htmlTreeParse(html.file)

Many thanks for any help you can provide,
Tony Breyal


> sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base

other attached packages:
[1] XML_1.94-0  RCurl_0.9-4



More information about the R-help mailing list