[R] Histogram over a Large Data Set (DB): How?

Tim Churches tchur at optushome.com.au
Fri Nov 18 23:58:32 CET 2005


Eric Eide wrote:
> "Sean" == Sean Davis <sdavis2 at mail.nih.gov> writes:
> 
> 	Sean> Have you tried just grabbing the whole column using dbGetQuery?
> 	Sean> Try doing this:
> 	Sean> 
> 	Sean> spams <- dbGetQuery(con,"select unixtime from email limit
> 	Sean> 1000000")
> 	Sean> 
> 	Sean> Then increase from 1,000,000 to 1.5 million, to 2 million, etc.
> 	Sean> until you break something (run out of memory), if you do at all.
> 
> Yes, you are right.  For the example problem that I posed, R can indeed process
> the entire query result in memory.  (The R process grows to 240MB, though!)
> 
> 	Sean> However, the BETTER way to do this, if you already have the data
> 	Sean> in the database is to allow the database to do the histogram for
> 	Sean> you.  For example, to get a count of spams by day, in MySQL do
> 	Sean> something like: [...]
> 
> Yes, again you are right --- the particular problem that I posed is probably
> better handled by formulating a more sophisticated SQL query.
> 
> But really, my goal isn't to solve the the example problem that I posed ---
> rather, it is to understand how people use R to process very large data sets.
> The research project that I'm working on will eventually need to deal with
> query results that cannot fit in main memory, and for which the built-in
> statistical facilities of most DBMSs will be insufficient.
> 
> Some of my colleagues have previously written their analyses "by hand," using
> various scripting languages to read and process records from a DB in chunks.
> Writing things in this way, however, can be tedious and error-prone.  Instead
> of taking this approach, I would like to be able to use existing statistics
> packages that have the ability to deal with large datasets in good ways.
> 
> So, I seek to understand the ways that people deal with these sorts of
> situations in R.  Your advice is very helpful --- one should solve problems in
> the simplest ways available! --- but I would still like to understand the
> harder cases, and how one can use "general" R functions in combination with
> DBI's `dbApply' and `fetch' interfaces, which divide results into chunks.

You might be interested in our project: "NetEpi Analysis", which aims to
provide interactive exploratory data analysis and basic epidemiological
analysis via both a Web front end and a Python programmatic API (forgive
the redundancy in "programmatic API") for datests up to around 30
million rows (and as many columns as you like) on 32 bit platforms -
hundreds of millions of rows should be feasible on 64-bit platforms. It
stores data column-wise in memory-mapped on-disc arrays, and uses set
operations on ordinal indexes to permit rapid subsetting and
cross-tabulation of categorical (factored) data. It is written in
Python, but uses R for graphics and some (but not all) statistical
calculations (and for model fitting when we get round to providing
facilities for same).

See http://www.netepi.org - still in alpha, with an update coming out by
December. Although it is aimed at epidemiological analysis (of large
administrative health datasets), I dare say it might be useful for
exploring large databases of spam too.

Tim C




More information about the R-help mailing list