[R] how to get how many lines there are in a file.
andy_liaw at merck.com
Mon Dec 6 18:26:42 CET 2004
> From: Liaw, Andy
> > From: Marc Schwartz
> > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:
> > > hi all
> > > If I wanna get the total number of lines in a big file
> > without reading
> > > the file's content into R as matrix or data frame, any methods or
> > > functions?
> > > thanks in advance.
> > > Regards
> > See ?readLines
> > You can use:
> > length(readLines("FileName"))
> > to get the number of lines read.
> > HTH,
> > Marc Schwartz
> On a system equipped with `wc' (*nix or Windows with such utilities
> installed and on PATH) I would use that. Otherwise
> might be a good choice.
Marc alerted me off-list that count.fields() might spent time delimiting
fields, which is not needed for the purpose of counting lines, and suggested
using sep="\n" as a possible way to make it more efficient. (Thanks, Marc!)
Here are some tests on a file with 14337 lines and 8900 fields (space
> system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
 48.86 0.24 49.30 0.00 0.00
> system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE)
 42.19 0.26 42.60 0.00 0.00
> system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE)
 37.77 0.56 38.35 0.00 0.00
> system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0, NULL))[],
Read 1 records
 0.00 0.00 0.33 0.08 0.25
My only concern with the readLines() approach is that it still needs to read
the entire file into memory (if I'm not mistaken), which may not be
> system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE)
 36.72 0.48 37.24 0.00 0.00
So it took 244+ MB just to store the text read in. I would use a loop and
read the file in small chunks, if I really want to do it in R.
More information about the R-help