[R] Downloading data from from internet
Duncan Temple Lang
duncan at wald.ucdavis.edu
Sat Sep 26 17:31:50 CEST 2009
Bogaso wrote:
> Thanks Duncan for your input. However I could not install the package
> "RHTMLForms", it is saying as not not available :
>
>> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
> Warning in install.packages("RHTMLForms", repos =
> "http://www.omegahat.org/R") :
> argument 'lib' is missing: using
> 'C:\Users\Arrun's\Documents/R/win-library/2.9'
> Warning message:
> In getDependencies(pkgs, dependencies, available, lib) :
> package ‘RHTMLForms’ is not available
>
> I found this package in net : http://www.omegahat.org/RHTMLForms/ However it
> is gz file which I could not use as I am a window user. Can you please
> provide me alternate source?
Hi Bogaso.
Yes, I made the package available in source form with the expectation
that people who were interested in using it would find out how to build it
for themselves.
I have made a binary version available of the package for R-2.9.*
so install.packages() will work for you on Windows.
However, you can use the source form of the package as a Windows
user; you just have to install it. That involves finding out how to do this
(either with Uwe's Windows package building service or by installing the tools
that Brian Ripley and Duncan Murdoch have spent time making available to more easily use.)
Generally (i.e. not pointing fingers at any one in particular), I do wish Windows users would learn
how to do things for themselves and not put further burden on people who provide them with "free"
software and "free" advice to also provide them with binary versions of easily
installed packages. It does take time for us to maintain different operating
systems and to create binaries. Running Windows and not being able to install
R packages from source is a choice, not a technical limitation.
D.
>
> Thanks,
>
>
>
> Duncan Temple Lang wrote:
>>
>>
>> Bogaso wrote:
>>> Thank you so much for those helps. However I need little more help. In
>>> the
>>> site
>>> "http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php"
>>> if I scroll below then there is an option "Historical CPI Index For USA"
>>> Next if I click on "Get Data" then another table pops-up, however without
>>> any significant change in address bar. This tables holds more data
>>> starting
>>> from 1999. Can you please help me how to get the values of this table?
>>>
>>
>> Hi again
>>
>> Well, this is a little bit more involved, as this is an HTML form
>> and so we need to be able to emulate submitting a form with
>> values for the different parameters the form expects, along with
>> ensuring they are correct inputs. Ordinarily, this would involve
>> looking at the source of the HTML document, finding the relevant
>> <form> element, getting its action attribute, and all its inputs
>> and figuring out the possible inputs. This is "straightforward"
>> but involved. But we have an R package that does this reasonably
>> well in an automated form. This is the RHTMLForms from the
>> www.omegahat.org/R repository.
>>
>> We can use this with
>> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R")
>>
>> Then
>>
>> library(RHTMLForms)
>>
>> ff =
>> getHTMLFormDescription("http://www.rateinflation.com/consumer-price-index/usa-historical-cpi.php")
>>
>> # The form we want is the third one. We can determine this
>> # from the names of the parameters.
>> # So we request that this form description be turned into an R function
>>
>> g = createFunction(ff[[3]])
>>
>> # Now we call this.
>> xx = g("2001", "2008")
>>
>>
>> # This returns the content of an HTML document
>> # so we parse it and then pass this to readHTMLTable()
>> # This is why we have methods for
>>
>> library(XML)
>> doc = htmlParse(xx, asText = TRUE)
>> tbls = readHTMLTable(doc)
>>
>> # we want the last of the tables.
>> tbls[[length(tbls)]]
>>
>>
>> So hopefully that helps solve your problem and introduces another Omegahat
>> package that
>> we hope people find through Google. The RHTMLForms package is an approach
>> to the
>> poor-man's Web services - HTML forms- rather than REST and SOAP that are
>> becoming more relevant
>> each day. The RCurl and SSOAP address the latter.
>>
>> D.
>>
>>
>>
>>
>>
>>> Thanks
>>>
>>>
>>> Duncan Temple Lang wrote:
>>>> Thanks for explaining this, Charlie.
>>>>
>>>> Just for completeness and to make things a little easier,
>>>> the XML package has a function named readHTMLTable()
>>>> and you can call it with a URL and it will attempt
>>>> to read all the tables in the page.
>>>>
>>>> tbls =
>>>> readHTMLTable('http://www.rateinflation.com/consumer-price-index/usa-cpi.php')
>>>>
>>>> yields a list with 10 elements, and the table of interest with the data
>>>> is
>>>> the 10th one.
>>>>
>>>> tbls[[10]]
>>>>
>>>> The function does the XPath voodoo and sapply() work for you and uses
>>>> some
>>>> heuristics.
>>>> There are various controls one can specify and also various methods for
>>>> working
>>>> with sub-parts of the HTML document directly.
>>>>
>>>> D.
>>>>
>>>>
>>>>
>>>> cls59 wrote:
>>>>> Bogaso wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I want to download data from those two different sources, directly
>>>>>> into
>>>>>> R
>>>>>> :
>>>>>>
>>>>>> http://www.rateinflation.com/consumer-price-index/usa-cpi.php
>>>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>>>
>>>>>> First one is CPI of US and 2nd one is WPI of India. Can anyone please
>>>>>> give
>>>>>> any clue how to download them directly into R. I want to make them zoo
>>>>>> object for further analysis.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>> The following site did not load for me:
>>>>>
>>>>> http://eaindustry.nic.in/asp2/list_d.asp
>>>>>
>>>>> But I was able to extract the table from the US CPI site using Duncan
>>>>> Temple
>>>>> Lang's XML package:
>>>>>
>>>>> library(XML)
>>>>>
>>>>>
>>>>> First, download the website into R:
>>>>>
>>>>> html.raw <- readLines(
>>>>> 'http://www.rateinflation.com/consumer-price-index/usa-cpi.php' )
>>>>>
>>>>> Then, convert to an HTML object using the XML package:
>>>>>
>>>>> html.data <- htmlTreeParse( html.raw, asText = T, useInternalNodes =
>>>>> T
>>>>> )
>>>>>
>>>>> A quick scan of the page source in the browser reveals that the table
>>>>> you
>>>>> want is encased in a div with a class of "dynamicContent"-- we will use
>>>>> a
>>>>> xpath specification[1] to retrieve all rows in that table:
>>>>>
>>>>> table.html <- getNodeSet( html.data,
>>>>> '//div[@class="dynamicContent"]/table/tr' )
>>>>>
>>>>> Now, the data values can be extracted from the cells in the rows using
>>>>> a
>>>>> little sapply and xpathXpply voodoo:
>>>>>
>>>>> table.data <- t( sapply( table.html, function( row ){
>>>>>
>>>>> row.data <- xpathSApply( row, './td', xmlValue )
>>>>> return( row.data)
>>>>>
>>>>> }))
>>>>>
>>>>>
>>>>> Good luck!
>>>>>
>>>>> -Charlie
>>>>>
>>>>> [1]: http://www.w3schools.com/XPath/xpath_syntax.asp
>>>>>
>>>>> -----
>>>>> Charlie Sharpsteen
>>>>> Undergraduate
>>>>> Environmental Resources Engineering
>>>>> Humboldt State University
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
More information about the R-help
mailing list