[R] ScrapeR Unanticipated XML objects

Sparks, John James jspark4 at uic.edu
Sun Aug 1 21:05:33 CEST 2010


Dear All,

I have come across a very surprising result as I have started to learn how
to use R to pull data from the web for analysis.

I am trying to isolate that table headers for the quarterly income
statement (qtrinc) that I pulled from Google finance.  I executed the
following commands after installing the scrapeR package.

require(scrapeR)
htmlfile<-scrape(url="http://www.google.com/finance?q=NASDAQ:MSFT&fstype=ii",headers=TRUE,parse=TRUE)

tables<-xpathSApply(htmlfile[[1]],"//table")
qtrinc<-tables[[1]]
xpathSApply(qtrinc,"//thead",xmlValue)


I receive the result:

[1] "\nIn Millions of USD (except for per share items)\n\n\n3 months
ending 2010-06-30\n\n\n3 months ending 2010-03-31\n\n\n3 months ending
2009-12-31\n\n\n3 months ending 2009-09-30\n\n\n3 months ending
2009-06-30\n\n"
[2] "\nIn Millions of USD (except for per share items)\n\n\n12 months
ending 2010-06-30\n\n\n12 months ending 2009-06-30\n\n\n12 months ending
2008-06-30\n\n\n12 months ending 2007-06-30\n\n"
[3] "\nIn Millions of USD (except for per share items)\n\n\nAs of
2010-06-30\n\n\nAs of 2010-03-31\n\n\nAs of 2009-12-31\n\n\nAs of
2009-09-30\n\n\nAs of 2009-06-30\n\n"
[4] "\nIn Millions of USD (except for per share items)\n\n\nAs of
2010-06-30\n\n\nAs of 2009-06-30\n\n\nAs of 2008-06-30\n\n\nAs of
2007-06-30\n\n"
[5] "\nIn Millions of USD (except for per share items)\n\n\n12 months
ending 2010-06-30\n\n\n9 months ending 2010-03-31\n\n\n6 months ending
2009-12-31\n\n\n3 months ending 2009-09-30\n\n"
[6] "\nIn Millions of USD (except for per share items)\n\n\n12 months
ending 2010-06-30\n\n\n12 months ending 2009-06-30\n\n\n12 months ending
2008-06-30\n\n\n12 months ending 2007-06-30\n\n"


Interestingly, only the first of these table headers exists in the list
qtrinc (if you list(qtrinc) you will see what I mean).  These are actually
the table headers for all the tables in the object htmlfile.

Can someone please help me isolate the table headers for only the object
qtrinc?

As long as I am at it, I also don't know how to remove the \n characters
when calling the data.

Help would be much appreciated.

--John Sparks, Ph.D.



More information about the R-help mailing list