[R] R/S and large datasets - Database access (also Re: SAS and S/R)
Timothy H. Keitt
tklistaddr at keittlab.bio.sunysb.edu
Wed Nov 28 19:27:36 CET 2001
Emmanuel Charpentier wrote:
> A consensus seems to emerge : R would excel to exploratory work on
> small/middle-sized datasets, while SAS would be able to munch much
> larger datasets.
>
> However, I see the "size" problem as a red herring. The objects that
> have to stay "in core" are usually much smaller than the dataset. For
> example, for problems involving fixed-effects linear models, you need
> only some matrices whose size is proportional to the square of the
> number of *variables* and the (admittedly large) vector of residues
> (whose size is equl to the number of observations). Other cases
> (nonlinear mixed effects models come to mind) are not as easily tamed
> (any iterative process (shuch as ML estimation) has to get back to
> original data), but at least, the time penalty involved in the use of
> such an interface pays back by allowing you to treat problems
> otherwise untractable.
>
> I am aware of at least one database access package that allows to
> access data without dragging a whole table in memory : the RPgSql
> package offers what it calls a "proxy variable", which is an objet
> that behaves, for all practical purposes, as a dataframe, but is an
> interface to database tables. I see this kind of interface as a way to
> avoid overloading core memory with data scarcely used.
>
> Unfortunately, the said package is now officially orphaned by its
> developper, which states that he now focuses on the next database
> access standard : the Rdbi interface, which is currently under
> development, and which I don't know a thing about.
>
> So the question is : do the Rdbi interface offers such a proxy to data
> still residing in databases ?
>
> Or am I barking up the wrong tree and trying to (re-)invent an
> oversophisticated virtual memory manager ? SShould the use of a
> suficiently large swapfile be enough for these "large dataset" problems ?
>
The problem with proxy data frames is that you can't pass them to
functions like 'lm' (at least when I tried it long ago), because the
functions that make the proxy object look like a data frame only exist
at the R level. When you drop down to internal C code, you call a
different set of (non-overloadable) functions, so it just appears as a
scalar object. Duncan's news about the generic "attach" interface may
soon make this possible however. Actually, I've found that having
learned some SQL, I now find it indespensible. As you say, generally you
only work with a small subset of your data, and SQL queries is the best
way I've found to do the subsetting.
Also, there has been some recent discussion of a proposed generic DBI
interface for R/S. Rdbi was my attempt (actually what I originally set
out to do with RPgSQL, but some necessary internal functions were not
yet documented or in some cases not yet implemented). We more-or-less
settled on David James' proposal, but I do not know if anyone is
actually implementing it. It would be nice to have a reference
implementation so we can try it out and see what we do or don't like. I
hope to see all of this resolved soon as I have less and less time to
put into it and my interests are moving elsewhere (e.g., more GIS
capabilities).
T.
--
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list