[R-sig-DB] help on deciding which open-source database to use with R

Wed May 16 15:51:19 CEST 2007

Hello,
I have been using postgreSQL with the postGIS extension for a few years
now and I'm really happy with its performance and features. I still use
mySQL for other applications but what really determined my choice was the
fact that the OGC simple spatial features were fully implemented in
postGIS whereas mySQL only had limited support of spatial features.

As far as connecting from R goes, I have been using RODBC successfully and
found it to be in more active development than other R packages. The other
advantage of using RODBC is that you will setup ODBC connectivity to your
database, hence making it accessible to other software that provide ODBC
bindings. In my case, I connect to the same database server from R, SAS,
Perl, Matlab, Python and PHP through ODBC.

I run all this on a Linux box and it's been a great learning experience.
Both postgreSQL and postGIS are now available as packages and are easily
installed in Ubuntu. That's quite different from a few years ago where one
needed to built from source, a task that can be intimidating if you don't
have access to a good sysadmin/DBadmin person.

All the best.

Daniel Ricard
Dalhousie University, Halifax, NS, Canada

> I've not seen any response, so here are some thoughts.
>
> On Fri, 11 May 2007, Mark W Kimpel wrote:
>
>> I need to store and access increasingly large numbers of microarray
>> datasets, which I analyze with R and BioC packages, and have decided to
>> delve into the world of relational databases. For those of you
>> unfamiliar with microarray datasets, they consist of unique indentifiers
>> with associated raw and summary data. The unique identifiers map to
>> established gene annotations that are updated regularly, a key reason I
>> would like to use a relational database.
>>
>> In addition to just storing results, I would also like the database to
>> perform SQL queries as well as use R within the database itself, so
>> that, for example, an FDR calculation could be done on a geneset that
>> was selected using various criteria from a web front-end without
>> explicitly invoking R (I think postgreSQL can do that). Finally, I would
>> like the database to be open-source and run on Linux.
>
> I think you are ruling out SQLite: its internal scripting capabilities are
> much less.  Our experience is that PostgreSQL is a more powerful DBMS, but
> harder to manage than MySQL.  If SQLite were in play, it can be very slow
> compared to the other two.
>
>> Here is what I have gathered from perusing reviews of databases and the
>> R mailing lists and the cran and BioC repositories:
>> 1. my top 3 choices would be MySQL, postgreSQL, and SQLite
>> 2. postgreSQL is probably the most powerful and ideologically "pure"
>> 3. MySQL has the largest user community and the most available books
>> 4. SQLite is the easiest to set up and R from within R
>> 5. there are several R packages for SQLite that assist with very routine
>> things like storing dataframes
>> 6. DBI and RMySQL seem to offer the most combined active development and
>> power from cran and RdbiPgSQL and postgreSQL would be an analogous
>> offering from BioC
>
> Where do you get that from?  For example, RMySQL still does not work out
> of the box on standard 64-bit systems (despite detailed reports and
> promises to incorporate them), and it and DBI are two of a handful of CRAN
> packages still using the deprecated SaveImage (and were the last two to
> make use of install.R).  That's not 'active development' in my book.
>
> RODBC has more access to advanced features (e.g. bulk updates, fast
> transfers to the database) and several people have reported it to be much
> faster than RMySQL (and that is also my experience).
>
>> 7. RODBC would allow me to use just about any of the databases as well
>> as Excel
>
> Excel only on Windows.
>
>> For all that "understanding", there is so much I can't figure out just
>> from reading disparate sources. In particular, in am concerned about: 1.
>> level of documentation so that I can learn, 2. likelihood of continued
>> support and development, 3. ability to satisfy my present (as outlined
>> in the first para) and unanticipated future needs, and 4. ease of use.
>>
>> Would someone who is familiar with these databases and how they "relate"
>> (pun intended) to the R and BioC communities compare and contrast them
>> for me?
>
> --
> Brian D. Ripley,                  ripley using stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> _______________________________________________
> R-sig-DB mailing list -- R Special Interest Group
> R-sig-DB using stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-db
>