[R] Extract Data from a Webpage
Duncan Temple Lang
duncan at wald.ucdavis.edu
Wed Dec 17 04:39:42 CET 2008
Hi Chuck.
Well, here is one way
theURL =
"http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490"
doc = htmlParse(theURL, useInternalNodes = TRUE,
error = function(...) {}) # discard any error messages
# Find the nodes in the table that are of interest.
x = xpathSApply(doc, "//table//td|//table//th", xmlValue)
Now depending on the regularity of the page, we can do something like
i = seq(1, by = 2, length = 3)
structure(x[i + 1], names = x[i])
And we end up with a named character vector with the fields of interest.
The useInternalNodes is vital so that we can use XPath. The XPath
language is very convenient for navigating subsets of the resulting
XML tree.
D.
Chuck Cleland wrote:
> Hi All:
> I would like to extract the provider name, address, and phone number
> from multiple webpages like this:
>
> http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490
>
> Based on searching R-help archives, it seems like the XML package
> might have something useful for this task. I can load the XML package
> and supply the url as an argument to htmlTreeParse(), but I don't know
> how to go from there.
>
> thanks,
>
> Chuck Cleland
>
>> sessionInfo()
> R version 2.8.0 Patched (2008-12-04 r47066)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] XML_1.98-1
>
More information about the R-help
mailing list