[R] Suggestion for big files [was: Re: A comment about R:]

Fri Jan 6 09:08:59 CET 2006

[Just one point extracted: Hadley Wickham has answered the random sample 
one]

On Thu, 5 Jan 2006, François Pinard wrote:

> [Brian Ripley]
>> One problem with Francois Pinard's suggestion (the credit has got lost)
>> is that R's I/O is not line-oriented but stream-oriented.  So selecting
>> lines is not particularly easy in R.
>
> I understand that you mean random access to lines, instead of random
> selection of lines.  Once again, this chat comes out of reading someone
> else's problem, this is not a problem I actually have.  SPSS was not
> randomly accessing lines, as data files could well be hold on magnetic
> tapes, where random access is not possible on average practice.  SPSS
> reads (or was reading) lines sequentially from beginning to end, and the
> _random_ sample is built while the reading goes.

That was not my point.  R's standard I/O is through connections, which 
allow for pushbacks, changing line endings and re-encoding character sets. 
That does add overhead compared to C/Fortran line-buffered reading of a 
file.  Skipping lines you do not need will take longer than you might 
guess (based on some limited experience).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595