[R] the large dataset problem
Roland Rau
roland.rproject at gmail.com
Mon Jul 30 21:34:30 CEST 2007
Eric Doviak wrote:
>
> I need to find some way to overcome these constraints and work with large datasets. Does anyone have any suggestions?
I might be not the most authoritative person on this subject but I put
all my large datasets[1] into an SQLite database and extract/summarize
data from it with R using the RSQLite package. If your data come in
ASCII format, it is rather easy to read them into an SQLite DB.
>
> I've read that I should "carefully vectorize my code." What does that mean ??? !!!
The book "S Programming" by Venables & Ripley has a sub-chapter on this.
If you happen to have John Chamber's "Programming with Data" book, there
are a few pages on "The Whole-Object View".
>
> I wrote a script which loads large datasets a few lines at a time, writes the dozen or so variables of interest to a CSV file, removes the loaded data and then (via a "for" loop) loads the next few lines .... I managed to get it to work with one of the SIPP core files, but it's SLOOOOW. Worse, if I discover later that I omitted a relevant variable, then I'll have to run the whole script all over again.
>
That means you have huge datasets but you never need the whole dataset?
Just a selected number of variables and then the files are of managable
size?
If this is the case, using RSQLite (or any other DB package, also RODBC
is very easy to use, if you have, for example, an MS Access DB) is a
good option. Alternatively, are you familiar with some old-fashioned
Unix-Tools? Ports for MS Windows also exist and the program 'cut' could
help you considerably.
Please note:
- I am only a causal user of the DB interfaces. So there might be better
solutions and people with more detailed knowledge about it.
- All the tools I mentioned here are licensed under the same or similar
free software licenses as R, so you should have no problems
obtaining/installing them.
- A good source of information is the R Data Import/Export Manual --
shipped with every R distribution and available online at
http://cran.at.r-project.org/doc/manuals/R-data.html
I hope this helps,
Roland
[1] The largest one is 1GB -- so probably not really large.
More information about the R-help
mailing list