[R] The Future of R | API to Public Databases

Sun Jan 15 16:23:07 CET 2012

At first, I thought that RDF and SDMX were two competing standards and 
was disheartened, but it appears that there is collaboration between 
them, yay!

http://groups.google.com/group/publishing-statistical-data/browse_thread/thread/531b1b5a73397c1c?pli=1

On 01/15/2012 08:31 AM, Benjamin Weber wrote:
> Yes, R-devel would be the right mailing list for this discussion.
> As some people pointed out, the problem definition is vague. This was
> to encourage people to share their *different* perceptions about the
> problem and to get to some extent a consensus.
>
> My starting point has come from my mind, consequently I must be an
> egocentric person. I agree on that.
> There are a lot of other egocentric persons who download R and just
> want to have their result ASAP. That's reality.
> The same is given with each and every special interest group (where
> each and every member has a special interest).
> Everyone cares only about his needs. That is the systematic issue we
> have to overcome by working together to simplify everyone's individual
> situation. Finally we should reach a win-win situation for all. That
> is my notion.
>
> What I wanted to point out was more or less about the process of a
> statistical research:
>
> 1. Set up your research objective
> 2. Find the right data (time intensive)
> 3. Download the right format
> 4. Import it, make it compatible, clean it up
> 5. Work with it
> 6. Get your results
>
> The more integrative your research objective is set up, the more time
> you spent on parts 1 to 3. And points 1 to 3 make up most of the time
> in most cases. Some people will resign due to lack of time or just due
> to lack of accessibility of data.
>
> I highly appreciate that a lot of people participated in this
> discussion, the publishers itself address the problem nowadays (just
> take a look at [1]) and some people are working on it in the R world
> (i.e. TSdbi).
>
> Reality is better than I initially perceived it. But is is not as it should be.
>
> Benjamin
>
>
> [1] http://sdmx.org/wp-content/uploads/2011/10/SDMX-Action-Plan-2011_2015.pdf
>
> On 15 January 2012 13:15, Prof Brian Ripley<ripley at stats.ox.ac.uk>  wrote:
>> On 14/01/2012 18:51, Joshua Wiley wrote:
>>> I have been following this thread, but there are many aspects of it
>>> which are unclear to me.  Who are the publishers?  Who are the users?
>>> What is the problem?  I have a vauge sense for some of these, but it
>>> seems to me like one valuable starting place would be creating a
>>> document that clarifies everything.  It is easier to tackle a concrete
>>> problem (e.g., agree on a standard numerical representation of dates
>>> and times a la ISO 8601) than something diffuse (e.g., information
>>> overload).
>>
>> Let alone something as vague as 'the future of R' (for which the R-devel
>> list is the appropriate one).  I believe the original poster is being
>> egocentric: as someone said earlier, she has never had need of this concept,
>> and I believe that is true of the vast majority of R users.
>>
>> The development of R per se is primarily driven by the needs of the core
>> developers and those around them.  Other R communities have sent up their
>> own special-interest groups and sets of packages, and that would seem the
>> way forward here.
>>
>>
>>> Good luck,
>>>
>>> Josh
>>>
>>> On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weber<mail at bwe.im>    wrote:
>>>> Mike
>>>>
>>>> We see that the publishers are aware of the problem. They don't think
>>>> that the raw data is the usable for the user. Consequently they
>>>> recognizing this fact with the proprietary formats. Yes, they resign
>>>> in the information overload. That's pathetic.
>>>>
>>>> It is not a question of *which* data format, it is a question about
>>>> the general concept. Where do publisher and user meet? There has to be
>>>> one *defined* point which all parties agree on. I disagree with your
>>>> statement that the publisher should just publish csv or cook his own
>>>> API. That leads to fragmentation and inaccessibility of data. We want
>>>> data to be accessible.
>>>>
>>>> A more pragmatic approach is needed to revolutionize the way we go
>>>> about raw data.
>>>>
>>>> Benjamin
>>>>
>>>> On 14 January 2012 22:17, Mike Marchywka<marchywka at hotmail.com>    wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> LOL, I remember posting about this in the past. The US gov agencies vary
>>>>> but mostare quite good. The big problem appears to be people who push
>>>>> proprietary orcommercial "standards" for which only one effective source
>>>>> exists. Some formats,like Excel and PDF come to mind and there is a
>>>>> disturbing trend towards theiradoption in some places where raw data is
>>>>> needed by many. The best thing to do is contact the informationprovider and
>>>>> let them know you want raw data, not images or stuff that worksin limited
>>>>> commercial software packages. Often data sources are valuable andthe revenue
>>>>> model impacts availability.
>>>>>
>>>>> If you are just arguing over different open formats,  it is usually easy
>>>>> for someone towrite some conversion code and publish it- CSV to JSON would
>>>>> not be a problem for example. Data of course are quite variable and there is
>>>>> nothingwrong with giving provider his choice.
>>>>>
>>>>> ----------------------------------------
>>>>>> Date: Sat, 14 Jan 2012 10:21:23 -0500
>>>>>> From: jason at rampaginggeek.com
>>>>>> To: r-help at r-project.org
>>>>>> Subject: Re: [R] The Future of R | API to Public Databases
>>>>>>
>>>>>> Web services are only part of the problem. In essence, there are at
>>>>>> least two facets:
>>>>>> 1. downloading the data using some protocol
>>>>>> 2. mapping the data to a common model
>>>>>>
>>>>>> Having #1 makes the import/download easier, but it really becomes
>>>>>> useful
>>>>>> when both are included. I think #2 is the harder problem to address.
>>>>>> Software can usually be written to handle #1 by making a useful
>>>>>> abstraction layer. #2 means that data has consistent names and
>>>>>> meanings,
>>>>>> and this requires people to agree on common definitions and a common
>>>>>> naming convention.
>>>>>>
>>>>>> RDF (Resource Description Framework) and its related technologies
>>>>>> (SPARQL, OWL, etc) are one of the many attempts to try to address this.
>>>>>> While this effort would benefit R, I think it's best if it's part of a
>>>>>> larger effort.
>>>>>>
>>>>>> Services such as DBpedia and Freebase are trying to unify many data
>>>>>> sets
>>>>>> using RDF.
>>>>>>
>>>>>> The task view and package ideas a great ideas. I'm just adding another
>>>>>> perspective.
>>>>>>
>>>>>> Jason
>>>>>>
>>>>>> On 01/13/2012 05:18 PM, Roy Mendelssohn wrote:
>>>>>>> HI Benjamin:
>>>>>>>
>>>>>>> What would make this easier is if these sites used standardized web
>>>>>>> services, so it would only require writing once. data.gov is the worst
>>>>>>> example, they spun the own, weak service.
>>>>>>>
>>>>>>> There is a lot of environmental data available through OPenDAP, and
>>>>>>> that is supported in the ncdf4 package. My own group has a service called
>>>>>>> ERDDAP that is entirely RESTFul, see:
>>>>>>>
>>>>>>> http://coastwatch.pfel.noaa.gov/erddap
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> http://upwell.pfeg.noaa.gov/erddap
>>>>>>>
>>>>>>> We provide R (and matlab) scripts that automate the extract for
>>>>>>> certain cases, see:
>>>>>>>
>>>>>>> http://coastwatch.pfeg.noaa.gov/xtracto/
>>>>>>>
>>>>>>> We also have a tool called the Environmental Data Connector (EDC) that
>>>>>>> provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to
>>>>>>> subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation
>>>>>>> Service (SOS) servers, and have it read directly into R. It is freely
>>>>>>> available at:
>>>>>>>
>>>>>>> http://www.pfeg.noaa.gov/products/EDC/
>>>>>>>
>>>>>>> We can write such tools because the service is either standardized
>>>>>>> (OPeNDAP, SOS) or is easy to implement (ERDDAP).
>>>>>>>
>>>>>>> -Roy
>>>>>>>
>>>>>>>
>>>>>>> On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote:
>>>>>>>
>>>>>>>> Dear R Users -
>>>>>>>>
>>>>>>>> R is a wonderful software package. CRAN provides a variety of tools
>>>>>>>> to
>>>>>>>> work on your data. But R is not apt to utilize all the public
>>>>>>>> databases in an efficient manner.
>>>>>>>> I observed the most tedious part with R is searching and downloading
>>>>>>>> the data from public databases and putting it into the right format.
>>>>>>>> I
>>>>>>>> could not find a package on CRAN which offers exactly this
>>>>>>>> fundamental
>>>>>>>> capability.
>>>>>>>> Imagine R is the unified interface to access (and analyze) all public
>>>>>>>> data in the easiest way possible. That would create a real impact,
>>>>>>>> would put R a big leap forward and would enable us to see the world
>>>>>>>> with different eyes.
>>>>>>>>
>>>>>>>> There is a lack of a direct connection to the API of these databases,
>>>>>>>> to name a few:
>>>>>>>>
>>>>>>>> - Eurostat
>>>>>>>> - OECD
>>>>>>>> - IMF
>>>>>>>> - Worldbank
>>>>>>>> - UN
>>>>>>>> - FAO
>>>>>>>> - data.gov
>>>>>>>> - ...
>>>>>>>>
>>>>>>>> The ease of access to the data is the key of information processing
>>>>>>>> with R.
>>>>>>>>
>>>>>>>> How can we handle the flow of information noise? R has to give an
>>>>>>>> answer to that with an extensive API to public databases.
>>>>>>>>
>>>>>>>> I would love your comments and ideas as a contribution in a vital
>>>>>>>> discussion.
>>>>>>>>
>>>>>>>> Benjamin
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>> **********************
>>>>>>> "The contents of this message do not reflect any position of the U.S.
>>>>>>> Government or NOAA."
>>>>>>> **********************
>>>>>>> Roy Mendelssohn
>>>>>>> Supervisory Operations Research Analyst
>>>>>>> NOAA/NMFS
>>>>>>> Environmental Research Division
>>>>>>> Southwest Fisheries Science Center
>>>>>>> 1352 Lighthouse Avenue
>>>>>>> Pacific Grove, CA 93950-2097
>>>>>>>
>>>>>>> e-mail: Roy.Mendelssohn at noaa.gov (Note new e-mail address)
>>>>>>> voice: (831)-648-9029
>>>>>>> fax: (831)-648-8440
>>>>>>> www: http://www.pfeg.noaa.gov/
>>>>>>>
>>>>>>> "Old age and treachery will overcome youth and skill."
>>>>>>> "From those who have been given much, much will be expected"
>>>>>>> "the arc of the moral universe is long, but it bends toward justice"
>>>>>>> -MLK Jr.
>>>>>>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595