[R] import HTML tables
Duncan Temple Lang
duncan at wald.ucdavis.edu
Wed May 13 15:55:24 CEST 2009
Dieter Menne wrote:
>
> Dimitri Szerman-2 wrote:
>> Hello,
>> I was wondering if there is a function in R that imports tables directly
>> from a HTML document.
>>
>
> The XML package can do this:
>
> http://markmail.org/message/cyicoa3htme4gei2
>
> Duncan Temple Lang:
>
> The htmlParse() and htmlTreeParse() functions in the XML package use the
> non-strict HTML parser in libxml2 and so the HTML document can be malformed.
Indeed. Thanks Dieter.
htmlParse() reads the document; getNodeSet allows us to
easily find the table or tables of interest.
We can find the th and td entries easily using XPath also.
The less automated part is how to meaningfully process the content.
That is where a human should be involved, deciding whether to trim
white space, how to convert text to values, dealing with missing cells.
We can do a lot by default, but ...
There is a relatively simple function at
http://www.omegahat.org/ParseXML/readHTMLTable.R
that provides something resembling read.table.
It is not well tested as in the past, I have just used XPath
directly as, once you know XPath, extracting content from HTML/XML is
very straightforward.
D.
>
>
> Dieter
More information about the R-help
mailing list