[R] How to separate huge dataset into chunks
Guillaume Filteau
filteau at unc.edu
Thu Mar 26 03:01:25 CET 2009
Hello Thomas,
Thanks for your help!
Sadly your code does not work for the last chunk, because its length is
shorter than nrows.
I tried
try(chunk<-read.table(conn, nrows=10000,col.names=nms), silent=TRUE)
but it gives me an error (go figure!)
Best,
Guillaume
Quoting Thomas Lumley <tlumley at u.washington.edu>:
> On Tue, 24 Mar 2009, Guillaume Filteau wrote:
>
>> Hello all,
>>
>> Im trying to take a huge dataset (1.5 GB) and separate it into
>> smaller chunks with R.
>>
>> So far I had nothing but problems.
>>
>> I cannot load the whole dataset in R due to memory problems. So, I
>> instead try to load a few (100000) lines at a time (with read.table).
>>
>> However, R kept crashing (with no error message) at about the
>> 6800000 line. This is extremely frustrating.
>>
>> To try to fix this, I used connections with read.table. However, I
>> now get a cryptic error telling me no lines available in input.
>>
>> Is there any way to make this work?
>>
>
> There might be an error in line 42 of your script. Or somewhere else.
> The error message is cryptically saying that there were no lines of
> text available in the input connection, so presumably the connection
> wasn't pointed at your file correctly.
>
> It's hard to guess without seeing what you are doing, but
> conn <- file("mybigfile", open="r")
> chunk<- read.table(conn, header=TRUE, nrows=10000)
> nms <- names(chunk)
> while(length(chunk)==10000){
> chunk<-read.table(conn, nrows=10000,col.names=nms)
> ## do something to the chunk
> }
> close(conn)
>
> should work. This sort of thing certainly does work routinely.
>
> It's probably not worth reading 100,000 lines at a time unless your
> computer has a lot of memory. Reducing the chunk size to 10,000
> shouldn't introduce much extra overhead and may well increase the
> speed by reducing memory use.
>
> -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, Seattle
>
>
>
More information about the R-help
mailing list