[R] how to get how many lines there are in a file.

Marc Schwartz MSchwartz at MedAnalytics.com
Mon Dec 6 22:10:18 CET 2004


On Mon, 2004-12-06 at 14:00 -0500, Liaw, Andy wrote:

> Marc,
> 
> I wrote the following function to read the file in chunks:
> 
> countLines <- function(file, chunk=1e3) {
>     f <- file(file, "r")
>     on.exit(close(f))
>     nLines <- 0
>     while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n
>     nLines
> }
> 
> To my surprise:
> 
> > system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE)
> [1] 35.24  0.26 35.53  0.00  0.00
> > system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE)
> [1] 36.10  0.32 36.43  0.00  0.00
> 
> There's almost no penalty (in time) in reading one line at a time.
> One do
> save quite a bit of memory, though.

Andy, 

I suspect that the conservation of time for reading one line at a time,
versus the larger chunks, is correlated to the use of disc caching and
"read ahead" functionality in the disk sub-system and the OS.

Thus, even though you are requesting one line to be read at a time in
your function, each physical read of the file by the disk sub-system is
in reality reading larger chunks of the file and storing that in cache
memory until needed or flushed by new data. 

So your function is taking advantage of higher speed memory to memory
transfers, versus disk to memory transfers, given the serial read nature
of the process.

As you point out however, system memory is conserved.

Best,

Marc




More information about the R-help mailing list