[R] Converting scraped data

Ethan Brown ethancbrown at gmail.com
Wed Oct 6 22:22:41 CEST 2010


Hi Simon,

You'll notice the "test" data.frame has a whole mix of characters in
the columns you're interested, including a "-" for missing values, and
that the columns you're interested in are in fact factors.

as.numeric(factor) returns the level of the factor, not the value of
the level. (See ?levels and ?factor)--that's why it's giving you those
irrelevant integers. I always end up using something like this handy
code snippet to deal with the situation:

unfactor <- function(factors)
# From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
# Transform a factor back into its factor names
{
   return(levels(factors)[factors])
}

Then, to get your data to where you want it, I'd do this:

require(XML)
theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
class(tables)
test<-data.frame(tables, stringsAsFactors=FALSE)


result <- test[11:42, 1:5] #Extract the actual data we want
names(result) <- c("Response", "Q1", "Q2","Q3","Q4")
for(i in 2:5) {
# Convert columns to factors
  result[,i] <- as.numeric(unfactor(result[,i]))
}
result

>From here you should be able to plot or do whatever else you want.

Hope this helps,
Ethan Brown


On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at gmail.com> wrote:
> Dear Colleagues,
> I used this code to scrape data from the URL conatined within.  This code
> should be reproducible.
>
> require("XML")
> library(XML)
> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
> tables <- readHTMLTable(theurl)
> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
> class(tables)
> test<-data.frame(tables, stringsAsFactors=FALSE)
> test[16,c(2:5)]
> as.numeric(test[16,c(2:5)])
> quartz()
> plot(c(1:4), test[15, c(2:5)])
>
> calling the values from the row of interest using test[16, c(2:5)] can bring
> them up as represented on the screen, plotting them or coercing them to
> numeric changes the values and in a way that doesn't make sense to me. My
> intuitino is that there is something going on with the way the characters
> are coded or classed when they're scraped into R.  I've looked around the
> help files for converting from character to numeric but can't find a
> solution.
>
> I also tried this:
>
> as.numeric(as.character(test[16,c(2:5)] and that also changed the values
> from what they originally were.
>
> I'm grateful for any suggestions.
> Yours, Simon Kiss
>
>
>
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list