[R-sig-DB] help on deciding which open-source database to use with R

Prof Brian Ripley r|p|ey @end|ng |rom @t@t@@ox@@c@uk
Wed May 16 08:18:03 CEST 2007


I've not seen any response, so here are some thoughts.

On Fri, 11 May 2007, Mark W Kimpel wrote:

> I need to store and access increasingly large numbers of microarray
> datasets, which I analyze with R and BioC packages, and have decided to
> delve into the world of relational databases. For those of you
> unfamiliar with microarray datasets, they consist of unique indentifiers
> with associated raw and summary data. The unique identifiers map to
> established gene annotations that are updated regularly, a key reason I
> would like to use a relational database.
>
> In addition to just storing results, I would also like the database to
> perform SQL queries as well as use R within the database itself, so
> that, for example, an FDR calculation could be done on a geneset that
> was selected using various criteria from a web front-end without
> explicitly invoking R (I think postgreSQL can do that). Finally, I would
> like the database to be open-source and run on Linux.

I think you are ruling out SQLite: its internal scripting capabilities are 
much less.  Our experience is that PostgreSQL is a more powerful DBMS, but 
harder to manage than MySQL.  If SQLite were in play, it can be very slow 
compared to the other two.

> Here is what I have gathered from perusing reviews of databases and the
> R mailing lists and the cran and BioC repositories:
> 1. my top 3 choices would be MySQL, postgreSQL, and SQLite
> 2. postgreSQL is probably the most powerful and ideologically "pure"
> 3. MySQL has the largest user community and the most available books
> 4. SQLite is the easiest to set up and R from within R
> 5. there are several R packages for SQLite that assist with very routine
> things like storing dataframes
> 6. DBI and RMySQL seem to offer the most combined active development and
> power from cran and RdbiPgSQL and postgreSQL would be an analogous
> offering from BioC

Where do you get that from?  For example, RMySQL still does not work out 
of the box on standard 64-bit systems (despite detailed reports and 
promises to incorporate them), and it and DBI are two of a handful of CRAN 
packages still using the deprecated SaveImage (and were the last two to 
make use of install.R).  That's not 'active development' in my book.

RODBC has more access to advanced features (e.g. bulk updates, fast 
transfers to the database) and several people have reported it to be much 
faster than RMySQL (and that is also my experience).

> 7. RODBC would allow me to use just about any of the databases as well
> as Excel

Excel only on Windows.

> For all that "understanding", there is so much I can't figure out just
> from reading disparate sources. In particular, in am concerned about: 1.
> level of documentation so that I can learn, 2. likelihood of continued
> support and development, 3. ability to satisfy my present (as outlined
> in the first para) and unanticipated future needs, and 4. ease of use.
>
> Would someone who is familiar with these databases and how they "relate"
> (pun intended) to the R and BioC communities compare and contrast them
> for me?

-- 
Brian D. Ripley,                  ripley using stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-sig-DB mailing list