[R] Converting scraped data
Brian Diggs
diggsb at ohsu.edu
Wed Oct 6 22:32:24 CEST 2010
On 10/6/2010 8:52 AM, Simon Kiss wrote:
> Dear Colleagues,
> I used this code to scrape data from the URL conatined within. This code
> should be reproducible.
>
> require("XML")
> library(XML)
> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm"
> tables <- readHTMLTable(theurl)
> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
> class(tables)
> test<-data.frame(tables, stringsAsFactors=FALSE)
> test[16,c(2:5)]
> as.numeric(test[16,c(2:5)])
> quartz()
> plot(c(1:4), test[15, c(2:5)])
>
> calling the values from the row of interest using test[16, c(2:5)] can
> bring them up as represented on the screen, plotting them or coercing
> them to numeric changes the values and in a way that doesn't make sense
> to me. My intuitino is that there is something going on with the way the
> characters are coded or classed when they're scraped into R. I've looked
> around the help files for converting from character to numeric but can't
> find a solution.
>
> I also tried this:
>
> as.numeric(as.character(test[16,c(2:5)] and that also changed the values
> from what they originally were.
>
> I'm grateful for any suggestions.
> Yours, Simon Kiss
str() gives you an indication of how things are stored and can help in
these situations.
> str(test)
'data.frame': 45 obs. of 10 variables:
$ NULL.V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1
1 35 1 1 1 23 18 2 32 ...
$ NULL.V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1
32 3 ...
$ NULL.V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA
NA 30 1 ...
$ NULL.V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA
NA 30 NA ...
$ NULL.V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1
NA NA 29 NA ...
$ NULL.V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
$ NULL.V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
$ NULL.V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
So columns 2-5 are factors, despite the stringsAsFactors=FALSE in the
data.frame call. That is because they were factors already in tables
> str(tables)
List of 1
$ NULL:'data.frame': 45 obs. of 10 variables:
..$ V1 : Factor w/ 41 levels "","2006","Afghanistan/Military",..: 1 1
35 1 1 1 23 18 2 32 ...
..$ V2 : Factor w/ 32 levels "","-","%","0",..: 28 1 27 30 1 1 1 1 32
3 ...
..$ V3 : Factor w/ 30 levels "","-","0.2","0.4",..: 1 1 1 1 1 1 NA NA
30 1 ...
..$ V4 : Factor w/ 30 levels "","0.1","0.2",..: NA 1 NA NA 1 1 NA NA
30 NA ...
..$ V5 : Factor w/ 29 levels "","0","0.2","0.3",..: NA 1 NA NA 1 1 NA
NA 29 NA ...
..$ V6 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA 1 NA ...
..$ V7 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V8 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V9 : Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
..$ V10: Factor w/ 1 level "": NA 1 NA NA 1 1 NA NA NA NA ...
So your idea that the "numbers" you see are really character
representations and not actually numbers is right. And you are almost
there with the as.numeric(as.character()) construct. That would work
for a single factor, but doesn't work for a data.frame.
> test[16,c(2:5)]
NULL.V2 NULL.V3 NULL.V4 NULL.V5
16 7.2 9.1 7.7 15.2
> as.character(test[16,c(2:5)])
[1] "25" "27" "26" "14"
You get a string representation of the underlying factor levels, not the
labels. If you do this column-by-column, it does work. Since
data.frames are special types of lists, you can use lapply:
> test[16,c(2:5)]
NULL.V2 NULL.V3 NULL.V4 NULL.V5
16 7.2 9.1 7.7 15.2
> lapply(test[16,c(2:5)], as.character)
$NULL.V2
[1] "7.2"
$NULL.V3
[1] "9.1"
$NULL.V4
[1] "7.7"
$NULL.V5
[1] "15.2"
> as.numeric(lapply(test[16,c(2:5)], as.character))
[1] 7.2 9.1 7.7 15.2
That said, I'd extract the responses part of the data out, clean it all,
and then do whatever you planned with it:
responses <- test[11:42,1:5]
responses[,1] <- factor(responses[,1])
responses[,2:5] <- lapply(responses[,2:5], function(x)
{as.numeric(as.character(x))})
names(responses) <- c("Response", "Q1", "Q2", "Q3", "Q4")
> str(responses)
'data.frame': 32 obs. of 5 variables:
$ Response: Factor w/ 32 levels "Afghanistan/Military",..: 5 6 4 8 9
10 11 12 14 15 ...
$ Q1 : num 2.4 2.1 NA 5.6 2.3 7.2 1 1.8 28.4 0.6 ...
$ Q2 : num 3.3 1.6 NA 5.6 1.8 9.1 0.4 2.4 19.4 2.1 ...
$ Q3 : num 3.4 1.3 0.3 5.3 2.6 7.7 0.3 1.3 21 1.7 ...
$ Q4 : num 2.7 1.5 0.6 5.1 1.3 15.2 0.2 0.7 16.7 2 ...
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
>
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list