[R] How to separate huge dataset into chunks
tlumley at u.washington.edu
Thu Mar 26 08:34:47 CET 2009
On Wed, 25 Mar 2009, Guillaume Filteau wrote:
> Hello Thomas,
> Thanks for your help!
> Sadly your code does not work for the last chunk, because its length is shorter
> than nrows.
You just need to move the test to the bottom of the loop
## do something to the chunk
> Quoting Thomas Lumley <tlumley at u.washington.edu>:
>> On Tue, 24 Mar 2009, Guillaume Filteau wrote:
>>> Hello all,
>>> Im trying to take a huge dataset (1.5 GB) and separate it into smaller
>>> chunks with R.
>>> So far I had nothing but problems.
>>> I cannot load the whole dataset in R due to memory problems. So, I instead
>>> try to load a few (100000) lines at a time (with read.table).
>>> However, R kept crashing (with no error message) at about the 6800000
>>> line. This is extremely frustrating.
>>> To try to fix this, I used connections with read.table. However, I now get
>>> a cryptic error telling me no lines available in input.
>>> Is there any way to make this work?
>> There might be an error in line 42 of your script. Or somewhere else. The
>> error message is cryptically saying that there were no lines of text
>> available in the input connection, so presumably the connection wasn't
>> pointed at your file correctly.
>> It's hard to guess without seeing what you are doing, but
>> conn <- file("mybigfile", open="r")
>> chunk<- read.table(conn, header=TRUE, nrows=10000)
>> nms <- names(chunk)
>> chunk<-read.table(conn, nrows=10000,col.names=nms)
>> ## do something to the chunk
>> should work. This sort of thing certainly does work routinely.
>> It's probably not worth reading 100,000 lines at a time unless your computer
>> has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce
>> much extra overhead and may well increase the speed by reducing memory use.
>> Thomas Lumley Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu University of Washington, Seattle
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help