[R] Extracting text from html code using the RCurl package.
Tony Breyal
tony.breyal at googlemail.com
Mon Oct 6 17:45:55 CEST 2008
Dear R-help,
I want to download the text from a web page, however what i end up
with is the html code. Is there some option that i am missing in the
RCurl package? Or is there another way to achieve this? This is the
code i am using:
> library(RCurl)
> my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help'
> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
> print(html.file)
I thought perhaps the htmlTreeParse() function from the XML package
might help, but I just don't know what to do next with it:
> library(XML)
> htmlTreeParse(html.file)
Many thanks for any help you can provide,
Tony Breyal
> sessionInfo()
R version 2.7.2 (2008-08-25)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
base
other attached packages:
[1] XML_1.94-0 RCurl_0.9-4
More information about the R-help
mailing list