[R] The Future of R | API to Public Databases

Sun Jan 15 03:22:19 CET 2012

HI,

Happy to oblidge:

Background: All data is "dirty" to some degree. A large portion of the 
time spent in doing data analysis is spent doing data cleaning (removing 
invalid data, transforming data columns into something useful). When 
data from multiple sources are used, then some time must be spent in 
making the data be able to be merged.

Publishers: anyone who provides data in large quantities, usually 
governments and public organizations.

Users: anyone who wants to use the data. This could be journalists, 
scientists, an concerned citizen, other organizations, etc...

Problems:
1. users of data have a hard time finding data, if they can find it at 
all. This is the rendezvous point. There should be a common service or 
place to publicize the data and allow people to find it. data markets 
such as Infochimps can help with this.

2. data is often published using different protocols. Some data sets are 
so big, that the data is accessed using a custom API. Many of these 
services use web services, but method names vary. This is a technical 
problem that can be worked around using libraries to translate from one 
protocol to another. A 3rd party may also help here by aggregating data 
sets. Publisher-specific libraries have been proposed to help address 
this, but I think those are also a compromise.

3. data sets rarely use common data fields/columns and what they measure 
may vary slightly. Having a common names, and definitions for often-used 
columns allows for confidence in merging the data and more accurate 
insights may be made.

If these issues can be solved, then large amount of data analysts' time 
can be freed up by reducing the data cleansing phase. On top of that, if 
the data can be merged in an automated way, then even laymen can do 
their own analysis. This problem is similar, if not identical, to the 
one being addressed by the "semantic web" movement.

These problems can't be solved just by using ISO-formatted dates, part 
of the problem is getting people to use common meanings for the fields.

Here is an example to illustrate: Public universities publish data such 
as the number of students enrolled. This number is often broken down by 
undergraduate and graduate students, but you have to know how that is 
measured. Are post-baccalaureates counted as graduate students? Were the 
students counted by head count or by full-time equivalent (FTE) (sum of 
total enrolled credit hours  / credit hours for a full-time student). 
Even the definition of FTE varies by university or by university system.

Jason

On 01/14/2012 01:51 PM, Joshua Wiley wrote:
> I have been following this thread, but there are many aspects of it
> which are unclear to me.  Who are the publishers?  Who are the users?
> What is the problem?  I have a vauge sense for some of these, but it
> seems to me like one valuable starting place would be creating a
> document that clarifies everything.  It is easier to tackle a concrete
> problem (e.g., agree on a standard numerical representation of dates
> and times a la ISO 8601) than something diffuse (e.g., information
> overload).
>
> Good luck,
>
> Josh
>
> On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weber<mail at bwe.im>  wrote:
>> Mike
>>
>> We see that the publishers are aware of the problem. They don't think
>> that the raw data is the usable for the user. Consequently they
>> recognizing this fact with the proprietary formats. Yes, they resign
>> in the information overload. That's pathetic.
>>
>> It is not a question of *which* data format, it is a question about
>> the general concept. Where do publisher and user meet? There has to be
>> one *defined* point which all parties agree on. I disagree with your
>> statement that the publisher should just publish csv or cook his own
>> API. That leads to fragmentation and inaccessibility of data. We want
>> data to be accessible.
>>
>> A more pragmatic approach is needed to revolutionize the way we go
>> about raw data.
>>
>> Benjamin
>>
>> On 14 January 2012 22:17, Mike Marchywka<marchywka at hotmail.com>  wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>> LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial "standards" for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability.
>>>
>>> If you are just arguing over different open formats,  it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice.
>>>
>>> ----------------------------------------
>>>> Date: Sat, 14 Jan 2012 10:21:23 -0500
>>>> From: jason at rampaginggeek.com
>>>> To: r-help at r-project.org
>>>> Subject: Re: [R] The Future of R | API to Public Databases
>>>>
>>>> Web services are only part of the problem. In essence, there are at
>>>> least two facets:
>>>> 1. downloading the data using some protocol
>>>> 2. mapping the data to a common model
>>>>
>>>> Having #1 makes the import/download easier, but it really becomes useful
>>>> when both are included. I think #2 is the harder problem to address.
>>>> Software can usually be written to handle #1 by making a useful
>>>> abstraction layer. #2 means that data has consistent names and meanings,
>>>> and this requires people to agree on common definitions and a common
>>>> naming convention.
>>>>
>>>> RDF (Resource Description Framework) and its related technologies
>>>> (SPARQL, OWL, etc) are one of the many attempts to try to address this.
>>>> While this effort would benefit R, I think it's best if it's part of a
>>>> larger effort.
>>>>
>>>> Services such as DBpedia and Freebase are trying to unify many data sets
>>>> using RDF.
>>>>
>>>> The task view and package ideas a great ideas. I'm just adding another
>>>> perspective.
>>>>
>>>> Jason
>>>>
>>>> On 01/13/2012 05:18 PM, Roy Mendelssohn wrote:
>>>>> HI Benjamin:
>>>>>
>>>>> What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service.
>>>>>
>>>>> There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see:
>>>>>
>>>>> http://coastwatch.pfel.noaa.gov/erddap
>>>>>
>>>>> and
>>>>>
>>>>> http://upwell.pfeg.noaa.gov/erddap
>>>>>
>>>>> We provide R (and matlab) scripts that automate the extract for certain cases, see:
>>>>>
>>>>> http://coastwatch.pfeg.noaa.gov/xtracto/
>>>>>
>>>>> We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at:
>>>>>
>>>>> http://www.pfeg.noaa.gov/products/EDC/
>>>>>
>>>>> We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP).
>>>>>
>>>>> -Roy
>>>>>
>>>>>
>>>>> On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote:
>>>>>
>>>>>> Dear R Users -
>>>>>>
>>>>>> R is a wonderful software package. CRAN provides a variety of tools to
>>>>>> work on your data. But R is not apt to utilize all the public
>>>>>> databases in an efficient manner.
>>>>>> I observed the most tedious part with R is searching and downloading
>>>>>> the data from public databases and putting it into the right format. I
>>>>>> could not find a package on CRAN which offers exactly this fundamental
>>>>>> capability.
>>>>>> Imagine R is the unified interface to access (and analyze) all public
>>>>>> data in the easiest way possible. That would create a real impact,
>>>>>> would put R a big leap forward and would enable us to see the world
>>>>>> with different eyes.
>>>>>>
>>>>>> There is a lack of a direct connection to the API of these databases,
>>>>>> to name a few:
>>>>>>
>>>>>> - Eurostat
>>>>>> - OECD
>>>>>> - IMF
>>>>>> - Worldbank
>>>>>> - UN
>>>>>> - FAO
>>>>>> - data.gov
>>>>>> - ...
>>>>>>
>>>>>> The ease of access to the data is the key of information processing with R.
>>>>>>
>>>>>> How can we handle the flow of information noise? R has to give an
>>>>>> answer to that with an extensive API to public databases.
>>>>>>
>>>>>> I would love your comments and ideas as a contribution in a vital discussion.
>>>>>>
>>>>>> Benjamin
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> **********************
>>>>> "The contents of this message do not reflect any position of the U.S. Government or NOAA."
>>>>> **********************
>>>>> Roy Mendelssohn
>>>>> Supervisory Operations Research Analyst
>>>>> NOAA/NMFS
>>>>> Environmental Research Division
>>>>> Southwest Fisheries Science Center
>>>>> 1352 Lighthouse Avenue
>>>>> Pacific Grove, CA 93950-2097
>>>>>
>>>>> e-mail: Roy.Mendelssohn at noaa.gov (Note new e-mail address)
>>>>> voice: (831)-648-9029
>>>>> fax: (831)-648-8440
>>>>> www: http://www.pfeg.noaa.gov/
>>>>>
>>>>> "Old age and treachery will overcome youth and skill."
>>>>> "From those who have been given much, much will be expected"
>>>>> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>