[R-sig-DB] help on deciding which open-source database to use with R

Mark W Kimpel mwk|mpe| @end|ng |rom gm@||@com
Sat May 12 03:43:48 CEST 2007


I need to store and access increasingly large numbers of microarray 
datasets, which I analyze with R and BioC packages, and have decided to 
delve into the world of relational databases. For those of you 
unfamiliar with microarray datasets, they consist of unique indentifiers 
with associated raw and summary data. The unique identifiers map to 
established gene annotations that are updated regularly, a key reason I 
would like to use a relational database.

In addition to just storing results, I would also like the database to 
perform SQL queries as well as use R within the database itself, so 
that, for example, an FDR calculation could be done on a geneset that 
was selected using various criteria from a web front-end without 
explicitly invoking R (I think postgreSQL can do that). Finally, I would 
like the database to be open-source and run on Linux.

Here is what I have gathered from perusing reviews of databases and the 
R mailing lists and the cran and BioC repositories:
1. my top 3 choices would be MySQL, postgreSQL, and SQLite
2. postgreSQL is probably the most powerful and ideologically "pure"
3. MySQL has the largest user community and the most available books
4. SQLite is the easiest to set up and R from within R
5. there are several R packages for SQLite that assist with very routine 
things like storing dataframes
6. DBI and RMySQL seem to offer the most combined active development and 
power from cran and RdbiPgSQL and postgreSQL would be an analogous 
offering from BioC
7. RODBC would allow me to use just about any of the databases as well 
as Excel

For all that "understanding", there is so much I can't figure out just 
from reading disparate sources. In particular, in am concerned about: 1. 
level of documentation so that I can learn, 2. likelihood of continued 
support and development, 3. ability to satisfy my present (as outlined 
in the first para) and unanticipated future needs, and 4. ease of use.

Would someone who is familiar with these databases and how they "relate" 
(pun intended) to the R and BioC communities compare and contrast them 
for me?

Thanks,

Mark

---

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 663-0513 Home (no voice mail please)




More information about the R-sig-DB mailing list