[R] Extracting a data.frame from HTML code

Paul Smith phhs80 at gmail.com
Sun Apr 13 10:29:37 CEST 2008


On Sun, Apr 13, 2008 at 12:37 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Ethan --
>
>  Use the XML library
>
>  > library(XML)
>  > url <- 'http://www.nascar.com/races/cup/2007/1/data/standings_official.html'
>  > xml <- htmlTreeParse(url, useInternal=TRUE)
>
>  The previous line retrieves the html and stores it in an internal
>  represnetation. There are warnings, but I think these are about
>  ill-formed HTML at nascar.com
>
>  A little looking suggests that the data you're after are table data
>  (element 'td') inside table rows ('tr') inside a 'tbody' element. A
>  little bit more looking shows that there's a blank line in the table,
>  at unlucky row 13, I guess.
>
>  So what we'd like to do is to extract all 'td' elements from all the
>  rows but unlucky 13. We do this with an 'xpath' query, which specifies
>  the path, from the root of the document through the relevant nodes, to
>  the data that we want. Here's the query and data extraction
>
>  > q <- "//tbody/tr[position()!=13]/td"
>  > dat <- unlist(xpathApply(xml, q, xmlValue))
>
>  The '//tbody' says 'find any tbody node somewhere below the current
>  (i.e., root, at this point in the query) node', '/' says 'immediately
>  below the current', we have some access to basic logic testing to
>  subset the nodes we're after. xmlValue extracts the 'value' (text
>  content, roughly) of the nodes that we've described the path to. This
>  is a nice weekend hack, relying on the overall structure of the table
>  and assuming, for instance, that there is only one tbody on the
>  page. We'd have to work harder during the week.
>
>  And then some R to make it into a data frame
>
>  > df <- as.data.frame(t(matrix(dat, 11)))
>
>  (11 because we've counted how many columns there are in the table; we
>  could have discovered this from the document, e.g.,
>  "count(//tbody/tr[1]/td)" as the xpath). The columns are all
>  character, whereas you'd like some to be numeric.
>
>  The page at http://www.w3.org/TR/xpath is very helpful for xpath,
>  especially section 2.5.
>
>  Hope that helps,
>
>  Martin
>
>
>
>  "Ethan Pew" <ethanpew+rlist at gmail.com> writes:
>
>  > Dear all,
>  >
>  > I'd like to use R to read in data from the web. I need some help finding an
>  > efficient way to strip the HTML tags and reformat the data as a data.frame
>  > to analyze in R.
>  >
>  > I'm currently using readLines() to read in the HTML code and then grep() to
>  > isolate the block of HTML code I want from each page, but this may not be
>  > the best approach.
>  >
>  > A short example:
>  > x1 <- readLines("
>  > http://www.nascar.com/races/cup/2007/1/data/standings_official.html",n=-1)
>  >
>  > grep1 <- grep("<table",x1,value=FALSE)
>  > grep2 <- grep("</table>",x1,value=FALSE)
>  >
>  > block1 <- x1[grep1:grep2]
>  >
>  >
>  > It seems like there should be a straightforward solution to extract a
>  > data.frame from the HTML code (especially since the data is already
>  > formatted as a table) but I haven't had any luck in my searches so far.
>  > Ultimately I'd like to compile several datasets from multiple webpages and
>  > websites, and I'm optimistic that I can use R to automate the process.  If
>  > someone could point me in the right direction, that would be fantastic.

Perhaps, the best way would be to copy the table to a spreadsheet and
then save it as a text file. Afterwards, to insert the data in R, use
read.table.

Paul



More information about the R-help mailing list