[R] how to get how many lines there are in a file.

Liaw, Andy andy_liaw at merck.com
Mon Dec 6 20:00:43 CET 2004


> From: Marc Schwartz
> 
> On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote:
> 
> > Marc alerted me off-list that count.fields() might spent 
> time delimiting
> > fields, which is not needed for the purpose of counting 
> lines, and suggested
> > using sep="\n" as a possible way to make it more efficient. 
>  (Thanks, Marc!)
> > 
> >  Here are some tests on a file with 14337 lines and  8900 
> fields (space
> > delimited).
> > 
> > > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
> > [1] 48.86  0.24 49.30  0.00  0.00
> > > system.time(n <- length(count.fields("hcv.ap", 
> sep="\n")), gcFirst=TRUE)
> > [1] 42.19  0.26 42.60  0.00  0.00
> 
> Andy,
> 
> I suspect that the relatively modest gain to be had here is the result
> of count.fields() still scanning the input buffer for the delimiting
> character, even though it would occur only once per line using the
> newline character. Thus, the overhead is not reduced substantially.
> 
> A scan of the source code for the .Internal function would validate
> that.
> 
> Thanks for testing this.
> 
> As both you and Thomas mention, 'wc' is clearly the fastest way to go
> based upon your additional figures.
> 
> Best regards,
> 
> Marc

Marc,

I wrote the following function to read the file in chunks:

countLines <- function(file, chunk=1e3) {
    f <- file(file, "r")
    on.exit(close(f))
    nLines <- 0
    while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n
    nLines
}

To my surprise:

> system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE)
[1] 35.24  0.26 35.53  0.00  0.00
> system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE)
[1] 36.10  0.32 36.43  0.00  0.00

There's almost no penalty (in time) in reading one line at a time.  One do
save quite a bit of memory, though.

Cheers,
Andy




More information about the R-help mailing list