[R] R tools for large files
Duncan Murdoch
dmurdoch at pair.com
Wed Aug 27 13:21:14 CEST 2003
On Wed, 27 Aug 2003 13:03:39 +1200 (NZST), you wrote:
>For real efficiency here, what's wanted is a variant of readLines
>where n is an index vector (a vector of non-negative integers,
>a vector of non-positive integers, or a vector of logicals) saying
>which lines should be kept.
I think that's too esoteric to be worth doing. Most often in cases
where you aren't reading every line, you don't know which lines to
read until you've read earlier ones.
>There are two fairly clear sources of overhead in the R code:
>(1) the overhead of reading characters one at a time through Rconn_fgetc()
> instead of a block or line at a time. mawk doesn't use fgets() for
> reading, and _does_ have the overhead of repeatedly checking a
> regular expression to determine where the end of the line is,
> which it is sensible enough to fast-path.
One complication with reading a block at a time is what to do when you
read too far. Not all connections can use seek() to reposition to the
beginning, so you'd need to read them one character at a time, (or
attach a buffer somehow, but then what about rw connections?)
>The simplest thing that could possibly work would be to add a function
>skipLines(con, n) which simply read and discarded n lines.
>
> result <- scan(textConnection(lines), list( .... ))
That's probably worth doing.
Duncan Murdoch
More information about the R-help
mailing list