[R] how to get how many lines there are in a file.
Liaw, Andy
andy_liaw at merck.com
Mon Dec 6 20:00:43 CET 2004
> From: Marc Schwartz
>
> On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote:
>
> > Marc alerted me off-list that count.fields() might spent
> time delimiting
> > fields, which is not needed for the purpose of counting
> lines, and suggested
> > using sep="\n" as a possible way to make it more efficient.
> (Thanks, Marc!)
> >
> > Here are some tests on a file with 14337 lines and 8900
> fields (space
> > delimited).
> >
> > > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
> > [1] 48.86 0.24 49.30 0.00 0.00
> > > system.time(n <- length(count.fields("hcv.ap",
> sep="\n")), gcFirst=TRUE)
> > [1] 42.19 0.26 42.60 0.00 0.00
>
> Andy,
>
> I suspect that the relatively modest gain to be had here is the result
> of count.fields() still scanning the input buffer for the delimiting
> character, even though it would occur only once per line using the
> newline character. Thus, the overhead is not reduced substantially.
>
> A scan of the source code for the .Internal function would validate
> that.
>
> Thanks for testing this.
>
> As both you and Thomas mention, 'wc' is clearly the fastest way to go
> based upon your additional figures.
>
> Best regards,
>
> Marc
Marc,
I wrote the following function to read the file in chunks:
countLines <- function(file, chunk=1e3) {
f <- file(file, "r")
on.exit(close(f))
nLines <- 0
while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n
nLines
}
To my surprise:
> system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE)
[1] 35.24 0.26 35.53 0.00 0.00
> system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE)
[1] 36.10 0.32 36.43 0.00 0.00
There's almost no penalty (in time) in reading one line at a time. One do
save quite a bit of memory, though.
Cheers,
Andy
More information about the R-help
mailing list