[R] How to separate huge dataset into chunks
Thomas Lumley
tlumley at u.washington.edu
Thu Mar 26 08:34:47 CET 2009
On Wed, 25 Mar 2009, Guillaume Filteau wrote:
> Hello Thomas,
>
> Thanks for your help!
>
> Sadly your code does not work for the last chunk, because its length is shorter
> than nrows.
>
You just need to move the test to the bottom of the loop
repeat{
chunk<-read.table(conn, nrows=10000,col.names=nms)
## do something to the chunk
if(length(chunk)<10000) break
}
>
>
> Quoting Thomas Lumley <tlumley at u.washington.edu>:
>
>> On Tue, 24 Mar 2009, Guillaume Filteau wrote:
>>
>>> Hello all,
>>>
>>> Im trying to take a huge dataset (1.5 GB) and separate it into smaller
>>> chunks with R.
>>>
>>> So far I had nothing but problems.
>>>
>>> I cannot load the whole dataset in R due to memory problems. So, I instead
>>> try to load a few (100000) lines at a time (with read.table).
>>>
>>> However, R kept crashing (with no error message) at about the 6800000
>>> line. This is extremely frustrating.
>>>
>>> To try to fix this, I used connections with read.table. However, I now get
>>> a cryptic error telling me no lines available in input.
>>>
>>> Is there any way to make this work?
>>>
>>
>> There might be an error in line 42 of your script. Or somewhere else. The
>> error message is cryptically saying that there were no lines of text
>> available in the input connection, so presumably the connection wasn't
>> pointed at your file correctly.
>>
>> It's hard to guess without seeing what you are doing, but
>> conn <- file("mybigfile", open="r")
>> chunk<- read.table(conn, header=TRUE, nrows=10000)
>> nms <- names(chunk)
>> while(length(chunk)==10000){
>> chunk<-read.table(conn, nrows=10000,col.names=nms)
>> ## do something to the chunk
>> }
>> close(conn)
>>
>> should work. This sort of thing certainly does work routinely.
>>
>> It's probably not worth reading 100,000 lines at a time unless your computer
>> has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce
>> much extra overhead and may well increase the speed by reducing memory use.
>>
>> -thomas
>>
>> Thomas Lumley Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu University of Washington, Seattle
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list