[R] Extracting a data.frame from HTML code
Paul Smith
phhs80 at gmail.com
Sun Apr 13 10:29:37 CEST 2008
On Sun, Apr 13, 2008 at 12:37 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Ethan --
>
> Use the XML library
>
> > library(XML)
> > url <- 'http://www.nascar.com/races/cup/2007/1/data/standings_official.html'
> > xml <- htmlTreeParse(url, useInternal=TRUE)
>
> The previous line retrieves the html and stores it in an internal
> represnetation. There are warnings, but I think these are about
> ill-formed HTML at nascar.com
>
> A little looking suggests that the data you're after are table data
> (element 'td') inside table rows ('tr') inside a 'tbody' element. A
> little bit more looking shows that there's a blank line in the table,
> at unlucky row 13, I guess.
>
> So what we'd like to do is to extract all 'td' elements from all the
> rows but unlucky 13. We do this with an 'xpath' query, which specifies
> the path, from the root of the document through the relevant nodes, to
> the data that we want. Here's the query and data extraction
>
> > q <- "//tbody/tr[position()!=13]/td"
> > dat <- unlist(xpathApply(xml, q, xmlValue))
>
> The '//tbody' says 'find any tbody node somewhere below the current
> (i.e., root, at this point in the query) node', '/' says 'immediately
> below the current', we have some access to basic logic testing to
> subset the nodes we're after. xmlValue extracts the 'value' (text
> content, roughly) of the nodes that we've described the path to. This
> is a nice weekend hack, relying on the overall structure of the table
> and assuming, for instance, that there is only one tbody on the
> page. We'd have to work harder during the week.
>
> And then some R to make it into a data frame
>
> > df <- as.data.frame(t(matrix(dat, 11)))
>
> (11 because we've counted how many columns there are in the table; we
> could have discovered this from the document, e.g.,
> "count(//tbody/tr[1]/td)" as the xpath). The columns are all
> character, whereas you'd like some to be numeric.
>
> The page at http://www.w3.org/TR/xpath is very helpful for xpath,
> especially section 2.5.
>
> Hope that helps,
>
> Martin
>
>
>
> "Ethan Pew" <ethanpew+rlist at gmail.com> writes:
>
> > Dear all,
> >
> > I'd like to use R to read in data from the web. I need some help finding an
> > efficient way to strip the HTML tags and reformat the data as a data.frame
> > to analyze in R.
> >
> > I'm currently using readLines() to read in the HTML code and then grep() to
> > isolate the block of HTML code I want from each page, but this may not be
> > the best approach.
> >
> > A short example:
> > x1 <- readLines("
> > http://www.nascar.com/races/cup/2007/1/data/standings_official.html",n=-1)
> >
> > grep1 <- grep("<table",x1,value=FALSE)
> > grep2 <- grep("</table>",x1,value=FALSE)
> >
> > block1 <- x1[grep1:grep2]
> >
> >
> > It seems like there should be a straightforward solution to extract a
> > data.frame from the HTML code (especially since the data is already
> > formatted as a table) but I haven't had any luck in my searches so far.
> > Ultimately I'd like to compile several datasets from multiple webpages and
> > websites, and I'm optimistic that I can use R to automate the process. If
> > someone could point me in the right direction, that would be fantastic.
Perhaps, the best way would be to copy the table to a spreadsheet and
then save it as a text file. Afterwards, to insert the data in R, use
read.table.
Paul
More information about the R-help
mailing list