[R] Histogram over a Large Data Set (DB): How?

Fri Nov 18 23:17:48 CET 2005

"Sean" == Sean Davis <sdavis2 at mail.nih.gov> writes:

	Sean> Have you tried just grabbing the whole column using dbGetQuery?
	Sean> Try doing this:
	Sean> 
	Sean> spams <- dbGetQuery(con,"select unixtime from email limit
	Sean> 1000000")
	Sean> 
	Sean> Then increase from 1,000,000 to 1.5 million, to 2 million, etc.
	Sean> until you break something (run out of memory), if you do at all.

Yes, you are right.  For the example problem that I posed, R can indeed process
the entire query result in memory.  (The R process grows to 240MB, though!)

	Sean> However, the BETTER way to do this, if you already have the data
	Sean> in the database is to allow the database to do the histogram for
	Sean> you.  For example, to get a count of spams by day, in MySQL do
	Sean> something like: [...]

Yes, again you are right --- the particular problem that I posed is probably
better handled by formulating a more sophisticated SQL query.

But really, my goal isn't to solve the the example problem that I posed ---
rather, it is to understand how people use R to process very large data sets.
The research project that I'm working on will eventually need to deal with
query results that cannot fit in main memory, and for which the built-in
statistical facilities of most DBMSs will be insufficient.

Some of my colleagues have previously written their analyses "by hand," using
various scripting languages to read and process records from a DB in chunks.
Writing things in this way, however, can be tedious and error-prone.  Instead
of taking this approach, I would like to be able to use existing statistics
packages that have the ability to deal with large datasets in good ways.

So, I seek to understand the ways that people deal with these sorts of
situations in R.  Your advice is very helpful --- one should solve problems in
the simplest ways available! --- but I would still like to understand the
harder cases, and how one can use "general" R functions in combination with
DBI's `dbApply' and `fetch' interfaces, which divide results into chunks.

Thanks!

Eric.

-- 
-------------------------------------------------------------------------------
Eric Eide <eeide at cs.utah.edu>  .         University of Utah School of Computing
http://www.cs.utah.edu/~eeide/ . +1 (801) 585-5512 voice, +1 (801) 581-5843 FAX