[R] RMySQL vs. Rdbi
J Chung
jeanhee at post.harvard.edu
Sun Jan 16 00:22:16 CET 2005
Hello,
I know that the topics of using large datasets in R vs. SAS, using
PostGreSQL vs MySQL, and using databases with R have been discussed
extensively on this list and elsewhere. However I hope that I have a
slightly new combination of the questions here.
I am doing my PhD research on a large dataset and trying to decide whether
to use PostgreSQL or MySQL with R, or simply use SAS, to which I also have
access. My databasing experience is currently limited to reading a few
chapters of a textbook, but I have some general programming experience in
C, C++, and Perl. I am leaning towards PGSQL at the moment because it seems
to have more core functions (i.e. less writing for me to do) but on the
other hand, our sysadmin already has MySQL installed, and I hear that it's
faster.
Of PostGreSQL vs. MySQL, which has the more mature interface with R? Are
there any issues with RMySQL or Rdbi.PGSQL (or .MySQL) that I should be
aware of and should they influence my decision for MySQL vs. PGSQL vs. the
SAS integrated database?
My dataset is about 26G, currently split up into files of 260 MB... about
540,000 records with 40 "explanatory" variables, many of which are probably
redundant but I just don't know at the moment. It was way too slow to
work with in R using Red Hat Linux machines with 500MB-1G RAM, especially
when producing plots. Preprocessing using Perl scripts every time I wanted
to look at a different subset of the data became too tedious. I hope to
create exploratory graphics such as sunflower plots, and also try some
lattices to help me get a feel for the data. Then I'm interested in trying
some stepwise ANOVA, and finally searching for patterns using discriminant
analysis, and/or classification trees.
I would greatly appreciate any advice you might have on choosing a
databasing software environment.
Thank you,
Jean Chung
More information about the R-help
mailing list