[R] Downloading data from from internet
Bogaso
bogaso.christofer at gmail.com
Sat Sep 26 07:45:29 CEST 2009
Thanks Duncan for your input. However I could not install the package
"RHTMLForms", it is saying as not not available :
> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
Warning in install.packages("RHTMLForms", repos =
"http://www.omegahat.org/R") :
argument 'lib' is missing: using
'C:\Users\Arrun's\Documents/R/win-library/2.9'
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
package ‘RHTMLForms’ is not available
I found this package in net : http://www.omegahat.org/RHTMLForms/ However it
is gz file which I could not use as I am a window user. Can you please
provide me alternate source?
Thanks,
Duncan Temple Lang wrote:
>
>
>
> Bogaso wrote:
>> Thank you so much for those helps. However I need little more help. In
>> the
>> site
>> "http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php"
>> if I scroll below then there is an option "Historical CPI Index For USA"
>> Next if I click on "Get Data" then another table pops-up, however without
>> any significant change in address bar. This tables holds more data
>> starting
>> from 1999. Can you please help me how to get the values of this table?
>>
>
>
> Hi again
>
> Well, this is a little bit more involved, as this is an HTML form
> and so we need to be able to emulate submitting a form with
> values for the different parameters the form expects, along with
> ensuring they are correct inputs. Ordinarily, this would involve
> looking at the source of the HTML document, finding the relevant
> <form> element, getting its action attribute, and all its inputs
> and figuring out the possible inputs. This is "straightforward"
> but involved. But we have an R package that does this reasonably
> well in an automated form. This is the RHTMLForms from the
> www.omegahat.org/R repository.
>
> We can use this with
> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
>
> Then
>
> library(RHTMLForms)
>
> ff =
> getHTMLFormDescription("http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php")
>
> # The form we want is the third one. We can determine this
> # from the names of the parameters.
> # So we request that this form description be turned into an R function
>
> g = createFunction(ff[[3]])
>
> # Now we call this.
> xx = g("2001", "2008")
>
>
> # This returns the content of an HTML document
> # so we parse it and then pass this to readHTMLTable()
> # This is why we have methods for
>
> library(XML)
> doc = htmlParse(xx, asText = TRUE)
> tbls = readHTMLTable(doc)
>
> # we want the last of the tables.
> tbls[[length(tbls)]]
>
>
> So hopefully that helps solve your problem and introduces another Omegahat
> package that
> we hope people find through Google. The RHTMLForms package is an approach
> to the
> poor-man's Web services - HTML forms- rather than REST and SOAP that are
> becoming more relevant
> each day. The RCurl and SSOAP address the latter.
>
> D.
>
>
>
>
>
>> Thanks
>>
>>
>> Duncan Temple Lang wrote:
>>>
>>> Thanks for explaining this, Charlie.
>>>
>>> Just for completeness and to make things a little easier,
>>> the XML package has a function named readHTMLTable()
>>> and you can call it with a URL and it will attempt
>>> to read all the tables in the page.
>>>
>>> tbls =
>>> readHTMLTable('http://www.rateinflation.com/consumer-price-index/usa-cpi.php')
>>>
>>> yields a list with 10 elements, and the table of interest with the data
>>> is
>>> the 10th one.
>>>
>>> tbls[[10]]
>>>
>>> The function does the XPath voodoo and sapply() work for you and uses
>>> some
>>> heuristics.
>>> There are various controls one can specify and also various methods for
>>> working
>>> with sub-parts of the HTML document directly.
>>>
>>> D.
>>>
>>>
>>>
>>> cls59 wrote:
>>>>
>>>> Bogaso wrote:
>>>>> Hi all,
>>>>>
>>>>> I want to download data from those two different sources, directly
>>>>> into
>>>>> R
>>>>> :
>>>>>
>>>>> http://www.rateinflation.com/consumer-price-index/usa-cpi.php
>>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>>
>>>>> First one is CPI of US and 2nd one is WPI of India. Can anyone please
>>>>> give
>>>>> any clue how to download them directly into R. I want to make them zoo
>>>>> object for further analysis.
>>>>>
>>>>> Thanks,
>>>>>
>>>> The following site did not load for me:
>>>>
>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>
>>>> But I was able to extract the table from the US CPI site using Duncan
>>>> Temple
>>>> Lang's XML package:
>>>>
>>>> library(XML)
>>>>
>>>>
>>>> First, download the website into R:
>>>>
>>>> html.raw <- readLines(
>>>> 'http://www.rateinflation.com/consumer-price-index/usa-cpi.php' )
>>>>
>>>> Then, convert to an HTML object using the XML package:
>>>>
>>>> html.data <- htmlTreeParse( html.raw, asText = T, useInternalNodes =
>>>> T
>>>> )
>>>>
>>>> A quick scan of the page source in the browser reveals that the table
>>>> you
>>>> want is encased in a div with a class of "dynamicContent"-- we will use
>>>> a
>>>> xpath specification[1] to retrieve all rows in that table:
>>>>
>>>> table.html <- getNodeSet( html.data,
>>>> '//div[@class="dynamicContent"]/table/tr' )
>>>>
>>>> Now, the data values can be extracted from the cells in the rows using
>>>> a
>>>> little sapply and xpathXpply voodoo:
>>>>
>>>> table.data <- t( sapply( table.html, function( row ){
>>>>
>>>> row.data <- xpathSApply( row, './td', xmlValue )
>>>> return( row.data)
>>>>
>>>> }))
>>>>
>>>>
>>>> Good luck!
>>>>
>>>> -Charlie
>>>>
>>>> [1]: http://www.w3schools.com/XPath/xpath_syntax.asp
>>>>
>>>> -----
>>>> Charlie Sharpsteen
>>>> Undergraduate
>>>> Environmental Resources Engineering
>>>> Humboldt State University
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Downloading-data-from-from-internet-tp25568930p25622550.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list