[R] Extracting text from html code using the RCurl package.

Martin Morgan mtmorgan at fhcrc.org
Tue Oct 7 17:57:19 CEST 2008


Hi Tony --

Tony Breyal <tony.breyal at googlemail.com> writes:

> Dear R-help,
>
> I want to download the text from a web page, however what i end up
> with is the html code. Is there some option that i am missing in the
> RCurl package? Or is there another way to achieve this? This is the
> code i am using:
>
>> library(RCurl)
>> 
>> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE)
>> print(html.file)
>
> I thought perhaps the htmlTreeParse() function from the XML package
> might help, but I just don't know what to do next with it:
>
>> library(XML)
>> htmlTreeParse(html.file)
>
> Many thanks for any help you can provide,

Sounds like you're on the right track. One way is to parse the html
file into its 'internal' representation, and then use xpathApply to
extract relevant information (e.g., the third 'p' (paragraph) element
from the XML mark-up

> html = htmlTreeParse(getURL(my.url), useInternal=TRUE)
Opening and ending tag mismatch: td and font
Unexpected end tag : p
Unexpected end tag : form
> xpathApply(html, "//p[3]", xmlValue)
[[1]]
[1] "You can subscribe to the list, or change your existing\r\n\t    subscription, in the sections below.\r\n\t"

the 'xpath' is the path from the root of the document through various
nested tags to tags of the specified type. "//p", says 'start at the
root ('/') and look in all sub-nodes (that this '//') for an 'p'
tag. ?xpathApply.  is a good starting place, as is
http://www.w3.org/TR/xpath, especially

http://www.w3.org/TR/xpath#path-abbrev 

Martin

> Tony Breyal
>
>
>> sessionInfo()
> R version 2.7.2 (2008-08-25)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
> 1252;LC_MONETARY=English_United Kingdom.
> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_1.94-0  RCurl_0.9-4
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the R-help mailing list