[R] How to separate huge dataset into chunks

Guillaume Filteau filteau at unc.edu
Thu Mar 26 03:01:25 CET 2009


Hello Thomas,

Thanks for your help!

Sadly your code does not work for the last chunk, because its length is 
shorter than nrows.

I tried

try(chunk<-read.table(conn, nrows=10000,col.names=nms), silent=TRUE)

but it gives me an error (go figure!)

Best,
Guillaume



Quoting Thomas Lumley <tlumley at u.washington.edu>:

> On Tue, 24 Mar 2009, Guillaume Filteau wrote:
>
>> Hello all,
>>
>> I’m trying to take a huge dataset (1.5 GB) and separate it into 
>> smaller chunks with R.
>>
>> So far I had nothing but problems.
>>
>> I cannot load the whole dataset in R due to memory problems. So, I 
>> instead try to load a few (100000) lines at a time (with read.table).
>>
>> However, R kept crashing (with no error message) at about the 
>> 6800000 line. This is extremely frustrating.
>>
>> To try to fix this, I used connections with read.table. However, I 
>> now get a cryptic error telling me “no lines available in input”.
>>
>> Is there any way to make this work?
>>
>
> There might be an error in line 42 of your script. Or somewhere else. 
> The error message is cryptically saying that there were no lines of 
> text available in the input connection, so presumably the connection 
> wasn't pointed at your file correctly.
>
> It's hard to guess without seeing what you are doing, but
>    conn <- file("mybigfile", open="r")
>    chunk<- read.table(conn, header=TRUE, nrows=10000)
>    nms <- names(chunk)
>    while(length(chunk)==10000){
>       chunk<-read.table(conn, nrows=10000,col.names=nms)
>       ## do something to the chunk
>    }
>    close(conn)
>
> should work. This sort of thing certainly does work routinely.
>
> It's probably not worth reading 100,000 lines at a time unless your 
> computer has a lot of memory. Reducing the chunk size to 10,000 
> shouldn't introduce much extra overhead and may well increase the 
> speed by reducing memory use.
>
>     -thomas
>
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
>
>
>




More information about the R-help mailing list