[R] how to get how many lines there are in a file.
Liaw, Andy
andy_liaw at merck.com
Mon Dec 6 18:26:42 CET 2004
> From: Liaw, Andy
>
> > From: Marc Schwartz
> >
> > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:
> > > hi all
> > > If I wanna get the total number of lines in a big file
> > without reading
> > > the file's content into R as matrix or data frame, any methods or
> > > functions?
> > > thanks in advance.
> > > Regards
> >
> > See ?readLines
> >
> > You can use:
> >
> > length(readLines("FileName"))
> >
> > to get the number of lines read.
> >
> > HTH,
> >
> > Marc Schwartz
>
>
> On a system equipped with `wc' (*nix or Windows with such utilities
> installed and on PATH) I would use that. Otherwise
> length(count.fields())
> might be a good choice.
>
> Cheers,
> Andy
Marc alerted me off-list that count.fields() might spent time delimiting
fields, which is not needed for the purpose of counting lines, and suggested
using sep="\n" as a possible way to make it more efficient. (Thanks, Marc!)
Here are some tests on a file with 14337 lines and 8900 fields (space
delimited).
> system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
[1] 48.86 0.24 49.30 0.00 0.00
> system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE)
[1] 42.19 0.26 42.60 0.00 0.00
> n
[1] 14337
> system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE)
[1] 37.77 0.56 38.35 0.00 0.00
> n2
[1] 14337
> system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0, NULL))[[1]],
gcFirst=T)
Read 1 records
[1] 0.00 0.00 0.33 0.08 0.25
> n3
[1] 14337
My only concern with the readLines() approach is that it still needs to read
the entire file into memory (if I'm not mistaken), which may not be
desirable:
> system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE)
[1] 36.72 0.48 37.24 0.00 0.00
> object.size(obj)/1024^2
[1] 244.6308
So it took 244+ MB just to store the text read in. I would use a loop and
read the file in small chunks, if I really want to do it in R.
Cheers,
Andy
More information about the R-help
mailing list