[R] R/S and large datasets - Database access (also Re: SAS and S/R)
Emmanuel Charpentier
charpent at bacbuc.dyndns.org
Tue Nov 27 19:28:06 CET 2001
David James wrote:
>The Rdbi (or perhaps simply DBI, for database interface, since it is
>meant to include both R and Splus) is a simple interface to any database
>management system or DBMS (so far only *relational* databases have been
>considered) very similar in spirit to Java's Database Connectivity (JDBC),
>Perl's Database Independent (DBI), Python's Database API. It deals
>primarily with a common set of function to interface R and Splus to
>databases (PostgreSQL, Oracle, Access, MySQL, mSQL, etc.) But we should
>think of this DBI only as a first step, or the infrastructure on which
>we can build more sophisticated tools. The proxy table/variable is a
>good example of such a tool. But if it's good for PostgreSQL tables,
>why not for Microsoft SQL tables? Or MySQL tables? By having a common
>interface, we hope to be able to build this sort of advanced tools
>independent of the underlying DBMS.
>
That should make ODBC your first target ... More than half the work is
already done by this interface.
>
>Other applications may include the ability to attach() any database
>to the search() path (together with the idea of proxy objects,
>it could be helpful in some cases); also, the possibility to do
>"database apply" where we apply R functions to chunks on remote
>tables. (Roger Koenker and his colleague have an LM example, see
>http://www.econ.uiuc.edu/~roger/research/rq/LM.html). There has also
>been some interest of approximating quantiles, applying GLM's, etc., to
>very large datasets, but techniques like these will most likely require
>new algorithms to work sequentially.
>
That's something that seems to have been already on the mind of
developpers of a large part of R. As far as I can tell, at least ...
>
>
>And of course, some also have pointed out (Brian Ripley, among others)
>that sampling has been used quite successfully before by statisticians:-)
>and thus could be quite useful in some of these cases.
>
That, IMHO, is aimed to a totally different set of problems. The
sampling of a part of a ataset to elaborate a model to validate on the
rest of the dataset is not specifc to very large datasets.
> I'm not aware
>of any tools available yet to do this on remote DBMSes, but one would
>hope that if such a tool were to be developed, it would be done on top
>of the DBI so that it could be used with any DBMS.
>
The easiest way is to select the subset through SQL queries, maybe
creating a small auxilliary table recording the subsetting for
reference purposes. This does not require much tools, just a working
knowledge of SQL and of the database structure. On large sites, with
DBAs, the latter isn't even necessary : just request from them a view
suiting your needs and the ability to create your subset index tables...
--
Emmanuel Charpentier
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list