[R] large object disorientation

Roger Koenker roger at ysidro.econ.uiuc.edu
Tue Nov 21 15:11:52 CET 2000

This is an inquiry for all those who have been working on external 
data base applications.  I sent an inquiry (below) to snews about
this sort of thing a couple of years ago and eventually decided that
I would wait to see what external database developments occurred and
then revisit the problem.  I hope that foundations are now better.

Suppose for the sake of concreteness you have a large dataframe-like
object stored in some compressed format (e.g.  I have a 48Mb stata
dataset that is about 2.5 million observations on about 40 variables.)
and you would like to do lm() fitting.  That is you would like to
specify that the data frame is somehow external, and using the formula
specification in lm() generate a sequence of queries that would return
chunks of rows of the dataframe, accumulate X'X and X'y, do Major
Cholesky's solve, and return.  All with a modest memory requirement
and in the blink of the cpu's eye.  I realize that it sounds a bit
retrograde to be doing least squares computations like this, but if
there were a good way to do this, then there would be good ways to
do lots of other more interesting things too, I believe. 

I expect that there are still grand plans for data type external sorts
of schemes that would, as John's message cited below suggests, allow
a transparent, fully featured, access in the language to large objects.
But for the moment I would be happy with something much more limited that
I could use to brew at home.  So any advice about which of the several
database interfaces might be suitable, and references to related applications
would be very welcome.

url:	http://www.econ.uiuc.edu		Roger Koenker	
email	roger at ysidro.econ.uiuc.edu		Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820

---------- Forwarded message ----------
Date: Thu, 2 Jul 1998 17:45:21 -0500 (CDT)
From: Roger Koenker <roger at ysidro.econ.uiuc.edu>
To: s-news at wubios.wustl.edu
Subject: [S] large object disorientation

Against my better judgement, I'm being inextrictably drawn into problems
of computing fits for linear models which involve X matrices which are
too big to handle comfortably as single objects in Splus.  Of course,
it is clear that one can do old-fashioned things like accumulate moment
matrices by looping over submatrices stored as separate datasets, but
this seems a bit too much like reinventing SAS.

John Chambers about 5 years ago suggested here that there was some ongoing
research on a more modern approach to this sort of thing.  I'm wondering
whether there is progress on this front which will appear in V 5.0
for Unix, or whether others have any experience which they would be
willing to share on this topic.

John's comment is available from the S-news archive: 


The basic idea of using the method/class approach to recognize
disaggregated large objects, operate on them a piece at a time
and then assemble the results seems very attractive.  Obviously,
it is going to be easier to to implement for some functions than
others, but it seems that it would be better to have a limited
functionality for large objects of this sort,  than nothing at
all.  In fact at the moment, I would be happy to have something
which managed to do basic indexing operations and some linear

url:	http://www.econ.uiuc.edu		Roger Koenker	
email	roger at ysidro.econ.uiuc.edu		Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820

This message was distributed by s-news at wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request at wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list