[R] R tools for large files
    Duncan Murdoch 
    dmurdoch at pair.com
       
    Wed Aug 27 13:21:14 CEST 2003
    
    
  
On Wed, 27 Aug 2003 13:03:39 +1200 (NZST), you wrote:
>For real efficiency here, what's wanted is a variant of readLines
>where n is an index vector (a vector of non-negative integers,
>a vector of non-positive integers, or a vector of logicals) saying
>which lines should be kept.
I think that's too esoteric to be worth doing.  Most often in cases
where you aren't reading every line, you don't know which lines to
read until you've read earlier ones.
>There are two fairly clear sources of overhead in the R code:
>(1) the overhead of reading characters one at a time through Rconn_fgetc()
>    instead of a block or line at a time.  mawk doesn't use fgets() for
>    reading, and _does_ have the overhead of repeatedly checking a
>    regular expression to determine where the end of the line is,
>    which it is sensible enough to fast-path.
One complication with reading a block at a time is what to do when you
read too far.  Not all connections can use seek() to reposition to the
beginning, so you'd need to read them one character at a time, (or
attach a buffer somehow, but then what about rw connections?)
>The simplest thing that could possibly work would be to add a function
>skipLines(con, n) which simply read and discarded n lines.
>
>	 result <- scan(textConnection(lines), list( .... ))
That's probably worth doing.
Duncan Murdoch
    
    
More information about the R-help
mailing list