[R] Downloading data from from internet

Duncan Temple Lang duncan at wald.ucdavis.edu
Thu Sep 24 17:56:06 CEST 2009


Thanks for explaining this, Charlie.

Just for completeness and to make things a little easier,
the XML package has a function named readHTMLTable()
and you can call it with a URL and it will attempt
to read all the tables in the page.

 tbls = readHTMLTable('http://www.rateinflation.com/consumer-price-index/usa-cpi.php')

yields a list with 10 elements, and the table of interest with the data is the 10th one.

 tbls[[10]]

The function does the XPath voodoo and sapply() work for you and uses some heuristics.
There are various controls one can specify and also various methods for working
with sub-parts of the HTML document directly.

  D.



cls59 wrote:
> 
> 
> Bogaso wrote:
>> Hi all,
>>
>> I want to download data from those two different sources, directly into R
>> :
>>
>> http://www.rateinflation.com/consumer-price-index/usa-cpi.php
>> http://eaindustry.nic.in/asp2/list_d.asp
>>
>> First one is CPI of US and 2nd one is WPI of India. Can anyone please give
>> any clue how to download them directly into R. I want to make them zoo
>> object for further analysis.
>>
>> Thanks,
>>
> 
> The following site did not load for me:
> 
> http://eaindustry.nic.in/asp2/list_d.asp
> 
> But I was able to extract the table from the US CPI site using Duncan Temple
> Lang's XML package:
> 
>   library(XML)
> 
> 
> First, download the website into R:
> 
>   html.raw <- readLines(
> 'http://www.rateinflation.com/consumer-price-index/usa-cpi.php' )
> 
> Then, convert to an HTML object using the XML package:
> 
>   html.data <- htmlTreeParse( html.raw, asText = T, useInternalNodes = T )
> 
> A quick scan of the page source in the browser reveals that the table you
> want is encased in a div with a class of "dynamicContent"-- we will use a
> xpath specification[1] to retrieve all rows in that table:
> 
>   table.html <- getNodeSet( html.data,
> '//div[@class="dynamicContent"]/table/tr' )
> 
> Now, the data values can be extracted from the cells in the rows using a
> little sapply and xpathXpply voodoo:
> 
>   table.data <- t( sapply( table.html, function( row ){
> 
>     row.data <-  xpathSApply( row, './td', xmlValue )
>     return( row.data)
> 
>   }))
> 
> 
> Good luck!
> 
> -Charlie
>  
>   [1]:  http://www.w3schools.com/XPath/xpath_syntax.asp
> 
> -----
> Charlie Sharpsteen
> Undergraduate
> Environmental Resources Engineering
> Humboldt State University




More information about the R-help mailing list