[R] Histogram over a Large Data Set (DB): How?

Fri Nov 18 00:25:13 CET 2005

Hi!  I'm new to R, and I have a question about how R works with large data sets
--- in particular, data sets that come from databases.

I'm using R 2.2.0 with the DBI package (0.1-9) and the RMySQL package (0.5-5).

My get-my-feet-wet-with-R project is to make a histogram from a data set stored
in a MySQL database.  In particular, I have a table that describes some
observed spam emails.  The 'unixtime' column of the table contains the
timestamps of the messages.  My current goal is to plot the number of spams per
day during the recording period.

So far, this is my script (edited for brevity), and it works well:

-----
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, group="spam")

# "where" clause to limit data set size, as described below.
spams <- dbGetQuery(con, "select unixtime from email where LENGTH(email_to)=4")

firstspam <- min(spams$unixtime)
lastspam <- max(spams$unixtime)
# spansize == one day
spansize <- 60 * 60 * 24

firstbreak = floor(firstspam / spansize)
lastbreak = celing(lastspam / spansize)
spambreaks = (firstbreak:lastbreak) * spansize

hist(spams$unixtime, br=spambreaks, plot=TRUE, col="red")
-----

The "where" clause serves to limit the number of records, while I figure out
the surrounding parts.  And now I'd like to do that... but I'm not sure how!

The actual "email" table has 2.9 million rows (a lot of spam!), so I presume
that I cannot (or at least, shouldn't) read it all at once.  Reading the
documentation of the RMySQL package, I understand that the syntax I want is
something like this:

-----
rs <- dbSendQuery(con, "select unixtime from email order by unixtime")
out <- dbApply(rs, INDEX = "unixtime",
         FUN = function(x, grp) hist(x$unixtime, ...))
-----

But I can't quite seem to make this work.  When I try the above directly, I get
errors like this:

  Error in mysqlDBApply(res, ...) : unimplemented type 'NULL' in 'length<-'

More generally, I'm pretty sure that I'm misusing `hist' in the code above.  I
don't want a series of histograms; I want just one histogram, made from the
concatenation of all the records fetched from the table.

I didn't find a recipe for doing this sort of thing in the R FAQ or in the
archives for this mailing list, so I'm hoping that someone can set me on the
right track.  How does one code in R with functions that expect "whole"
vectors, and somehow provide those vectors in a "piecemeal" fashion?  Is there
a general recipe for this situation in R (aren't large data sets common?), or
do I need to code the histogram generator myself?

Thanks for any help! ---

Eric.

-- 
-------------------------------------------------------------------------------
Eric Eide <eeide at cs.utah.edu>  .         University of Utah School of Computing
http://www.cs.utah.edu/~eeide/ . +1 (801) 585-5512 voice, +1 (801) 581-5843 FAX