[R] Manage huge database

Mon Sep 22 10:41:30 CEST 2008

2008/9/22 José E.  Lozano <lozalojo at jcyl.es>:
>> I wouldn't call a 4GB csv text file a 'database'.

> It didn't help, sorry. I perfectly knew what a relational database is (and I
> humbly consider myself an advanced user on working with MSAccess+VBA, only
> that I've never face this problem with variables), you should not suppose
> everyone's stupid, though...

 Maybe you've not lurked on R-help for long enough :) Apologies!

A bit more googling tells me both MySQL and PostgreSQL have limits of
a few thousand on the number of columns in a table, not a few hundred
thousand. An insightful comment on one mailing list is:

"Of course, the real bottom line is that if you think you need more than
order-of-a-hundred columns, your database design probably needs revision
anyway ;-)"

 So, how much "design" is in this data? If none, and what you've
basically got is a 2000x500000 grid of numbers, then maybe a more raw
binary-type format will help - HDF or netCDF? Although I'm not sure
how much R support for reading slices of these formats exists, you may
be able to use an external utility to write slices out on demand.
Random access to parts of these files is pretty fast.

http://cran.r-project.org/web/packages/RNetCDF/index.html
http://cran.r-project.org/web/packages/hdf5/index.html

 Thinking back to your 4GB file with 1,000,000,000 entries, that's
only 3 bytes per entry (+1 for the comma). What is this data? There
may be more efficient ways to handle it.

 Hope *that* helps...

Barry