[R-sig-DB] Working with Large sets of Data or even Big Data

Wed Mar 6 21:07:44 CET 2013

On 2013-03-06 11:24, Paul Bernal wrote:

> I managed to connect R with some database tables residing in 
> Microsoft
> SQL
> Server, and I also got R to read the information in. My question is,
> since
> usually, those tables can have 50 thousand or even more records, or 
> even
> hundreds of thousands or millions of records what would be the 
> maximum
> amount of records that R is able to read and process or better put, 
> what
> would be the maximum amount of rows and columns that R is able to 
> manage
> (to perform data mining analysis)? What would I need to do in order 
> for R
> to be able to manage large amounts of data?

Hi Paul!

The main constraint for R will be memory.  The run-times are designed 
in such a way that they'll use as much memory as possible.  A 50,000 
record database is by no means "large".  We manipulate DBs with 2 
million or more records in our machines, each with 20-50 fields, and 
many of the fields contain long text (my company aggregates and 
summarizes breaking news stories).  I write most of the R scripts in a 
Mac with 16 GB RAM, run them on servers that have either 8 or 16 GB RAM, 
same data set.  R is happy.

I would suggest that, if possible, you don't read directly from the 
database and instead create a simple ETL process for R.  R run-times 
like to "hang" while processing lots of data (some times it's just slow 
processing) and the DB driver may see that as "connection hung" and drop 
it.

My humble suggestion for a workflow:

1. Dump your DB/query to a TSV (tab separated values) file

2. Scrub the data with awk or equivalent to ensure that it doesn't
    have weird characters (non-UTF-8 and so on), or extraneous \n
    or \r that could mess up ingestion into R

3. Split the data into multiple files, say of 500,000 records each

4. Ingest the data into an R data frame with something like:

    myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="", 
comment.char="")

    This assumes that the first row will have the column names

5. Save your data frame NOW!  That's because if R hangs/crashes/etc. 
you
    won't have to reload from TSV -- that's slow -- so:

    save(myData, "some-file-name.RData", compress="bzip2")

6. Repeat steps 4 and 5 until you have the whole data set in memory; 
you
    can "stack" two data sets like this:

    myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="", 
comment.char="")

    myTotalData <- myData

    save(myTotalData, "some-file-name.RData", compress="bzip2")

    # second and all remaining split files:

    myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="", 
comment.char="")

    myTotalData <- rbind(myTotalData, myData)

    save(myTotalData, "some-file-name.RData", compress="bzip2")

    # Repeat this last three steps until you get everything in memory

7. You're ready to manipulate this to your heart's content!  Transform, 
extract, analyze,
    whatever you need to do.

By bringing the data into R's space you will have a more 
memory-efficient representation for your operations.  By saving it to an 
.RData file you will have that memory-efficient representation stored in 
a tightly compressed format (bzip2 rocks for that) that you can more 
easily move around systems.

Here are some examples from our own databases:

* TSV dump to disk: 12.5 GB
* After scrubbing:  12.45 GB -- ~5,000 fewer records
* .RData representation:  1.3 GB

This fits comfortably in main memory on a MacBook Air.  The servers 
will have no problem dealing with it.

In general, you should plan on your R use not happening in real-time, 
against your live database (think best practices).  If you're banging 
against a production database, for example, you may slow other users' 
access down while running your queries.

Good luck, and let us know how it goes!

pr3d4t0r
-- 
http://summly.com | http://eugeneciurana.com