[R] Scraping a web page.
J Toll
jctoll at gmail.com
Tue May 15 01:18:14 CEST 2012
On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub <kw1958 at gmail.com> wrote:
> Folks,
> I want to scrape a series of web-page sources for strings like the following:
>
> "/en/Ships/A-8605507.html"
> "/en/Ships/Aalborg-8122830.html"
>
> which appear in an href inside an <a> tag inside a <div> tag inside a table.
>
> In fact all I want is the (exactly) 7-digit number before ".html".
>
> The good news is that as far as I can tell the the <a> tag is always on it's own line so some kind of line-by-line grep should suffice once I figure out the following:
>
> What is the best package/command to use to get the source of a web page. I tried using something like:
> if(url.exists("http://www.omegahat.org/RCurl")) {
> h = basicTextGatherer()
> curlPerform(url = "http://www.omegahat.org/RCurl", writefunction = h$update)
> # Now read the text that was cumulated during the query response.
> h$value()
> }
>
> which works except that I get one long streamed html doc without the line breaks.
You could use:
h <- readLines("http://www.omegahat.org/RCurl")
-- or --
download.file(url = "http://www.omegahat.org/RCurl", destfile = "tmp.html")
h = scan("tmp.html", what = "", sep = "\n")
and then use grep or the XML package for processing.
HTH
James
More information about the R-help
mailing list