[R] reading very large files

Fri Feb 2 19:34:58 CET 2007

I suspect that reading from a connection in chunks of say 10,000 rows and 
discarding those you do not want would be simpler and at least as quick.
Not least because seek() on Windows is so unreliable.

On Fri, 2 Feb 2007, Henrik Bengtsson wrote:

> Hi.
>
> General idea:
>
> 1. Open your file as a connection, i.e. con <- file(pathname, open="r")
>
> 2. Generate a "row to (file offset, row length) map of your text file,
> i.e. a numeric vector 'fileOffsets' and 'rowLengths'.  Use readBin()
> for this. You build this up as you go by reading the file in chunks
> meaning you can handles files of any size.  You can store this lookup
> map to file for your future R sessions.
>
> 3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows =
> sample(length(fileOffsets)).
>
> 4. Look up the file offsets and row lengths for these rows, i.e.
> offsets = fileOffsets[rows].  lengths = rowLengths[rows].
>
> 5. In case your subset of rows is not ordered, it is wise to order
> them first to speed up things.  If order is important, keep track of
> the ordering and re-order them at then end.
>
> 6. For each row r, use seek(con=con, where=offsets[r]) to jump to the
> start of the row.  Use readBin(..., n=lengths[r]) to read the data.
>
> 7. Repeat from (3).
>
> /Henrik
>
> On 2/2/07, juli g. pausas <pausas at gmail.com> wrote:
>> Hi all,
>> I have a large file (1.8 GB) with 900,000 lines that I would like to read.
>> Each line is a string characters. Specifically I would like to randomly
>> select 3000 lines. For smaller files, what I'm doing is:
>>
>> trs <- scan("myfile", what= character(), sep = "\n")
>> trs<- trs[sample(length(trs), 3000)]
>>
>> And this works OK; however my computer seems not able to handle the 1.8 G
>> file.
>> I thought of an alternative way that not require to read the whole file:
>>
>> sel <- sample(1:900000, 3000)
>> for (i in 1:3000)  {
>> un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1)
>> write(un, "myfile_short", append=TRUE)
>> }
>>
>> This works on my computer; however it is extremely slow; it read one line
>> each time. It is been running for 25 hours and I think it has done less than
>> half of the file (Yes, probably I do not have a very good computer and I'm
>> working under Windows ...).
>> So my question is: do you know any other faster way to do this?
>> Thanks in advance
>>
>> Juli
>>
>> --
>> http://www.ceam.es/pausas
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595