[R-sig-DB] Working with Large sets of Data or even Big Data
CIURANA EUGENE (R users list)
r@u@er @end|ng |rom c|ur@n@@eu
Wed Mar 6 21:07:44 CET 2013
On 2013-03-06 11:24, Paul Bernal wrote:
> I managed to connect R with some database tables residing in
> Microsoft
> SQL
> Server, and I also got R to read the information in. My question is,
> since
> usually, those tables can have 50 thousand or even more records, or
> even
> hundreds of thousands or millions of records what would be the
> maximum
> amount of records that R is able to read and process or better put,
> what
> would be the maximum amount of rows and columns that R is able to
> manage
> (to perform data mining analysis)? What would I need to do in order
> for R
> to be able to manage large amounts of data?
Hi Paul!
The main constraint for R will be memory. The run-times are designed
in such a way that they'll use as much memory as possible. A 50,000
record database is by no means "large". We manipulate DBs with 2
million or more records in our machines, each with 20-50 fields, and
many of the fields contain long text (my company aggregates and
summarizes breaking news stories). I write most of the R scripts in a
Mac with 16 GB RAM, run them on servers that have either 8 or 16 GB RAM,
same data set. R is happy.
I would suggest that, if possible, you don't read directly from the
database and instead create a simple ETL process for R. R run-times
like to "hang" while processing lots of data (some times it's just slow
processing) and the DB driver may see that as "connection hung" and drop
it.
My humble suggestion for a workflow:
1. Dump your DB/query to a TSV (tab separated values) file
2. Scrub the data with awk or equivalent to ensure that it doesn't
have weird characters (non-UTF-8 and so on), or extraneous \n
or \r that could mess up ingestion into R
3. Split the data into multiple files, say of 500,000 records each
4. Ingest the data into an R data frame with something like:
myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="",
comment.char="")
This assumes that the first row will have the column names
5. Save your data frame NOW! That's because if R hangs/crashes/etc.
you
won't have to reload from TSV -- that's slow -- so:
save(myData, "some-file-name.RData", compress="bzip2")
6. Repeat steps 4 and 5 until you have the whole data set in memory;
you
can "stack" two data sets like this:
myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="",
comment.char="")
myTotalData <- myData
save(myTotalData, "some-file-name.RData", compress="bzip2")
# second and all remaining split files:
myData <- read.table(file.choose(), sep="\t", header=TRUE, quote="",
comment.char="")
myTotalData <- rbind(myTotalData, myData)
save(myTotalData, "some-file-name.RData", compress="bzip2")
# Repeat this last three steps until you get everything in memory
7. You're ready to manipulate this to your heart's content! Transform,
extract, analyze,
whatever you need to do.
By bringing the data into R's space you will have a more
memory-efficient representation for your operations. By saving it to an
.RData file you will have that memory-efficient representation stored in
a tightly compressed format (bzip2 rocks for that) that you can more
easily move around systems.
Here are some examples from our own databases:
* TSV dump to disk: 12.5 GB
* After scrubbing: 12.45 GB -- ~5,000 fewer records
* .RData representation: 1.3 GB
This fits comfortably in main memory on a MacBook Air. The servers
will have no problem dealing with it.
In general, you should plan on your R use not happening in real-time,
against your live database (think best practices). If you're banging
against a production database, for example, you may slow other users'
access down while running your queries.
Good luck, and let us know how it goes!
pr3d4t0r
--
http://summly.com | http://eugeneciurana.com
More information about the R-sig-DB
mailing list