[R] import HTML tables

Duncan Temple Lang duncan at wald.ucdavis.edu
Wed May 13 15:55:24 CEST 2009


Dieter Menne wrote:
> 
> Dimitri Szerman-2 wrote:
>> Hello,
>> I was wondering if there is a function in R that imports tables directly
>> from a HTML document.
>>
> 
> The XML package can do this:
> 
> http://markmail.org/message/cyicoa3htme4gei2
> 
> Duncan Temple Lang:
> 
> The htmlParse() and htmlTreeParse() functions in the XML package use the
> non-strict HTML parser in libxml2 and so the HTML document can be malformed. 

Indeed. Thanks Dieter.

htmlParse() reads the document; getNodeSet allows us to
easily find the table or tables of interest.
We can find the th and td entries easily using XPath also.

The less automated part is how to meaningfully process the content.
That is where a human  should be involved, deciding whether to trim
white space, how to convert text to values, dealing with missing cells.
We can do a lot by default, but ...


There is a relatively simple function at

   http://www.omegahat.org/ParseXML/readHTMLTable.R

that provides something resembling read.table.
It is not well tested as in the past, I have just used XPath
directly as, once you know XPath, extracting content from HTML/XML is
very straightforward.

   D.


> 
> 
> Dieter




More information about the R-help mailing list