[Rd] Slow 'read.table' in R 1.4.0 (PR#1232)

ripley@stats.ox.ac.uk ripley@stats.ox.ac.uk
Sat, 29 Dec 2001 22:25:39 +0100 (MET)


As I have told you privately several days ago,

*This has already been fixed in R-patched*.


On Sat, 29 Dec 2001 james.holtman@convergys.com wrote:

> The 'read.table' function appears to be up to 10X slower in R 1.4.0 than R
> 1.3.1 for some of the data sets I read in.  I was comparing the source code
> for the 2 versions and see that it was rewritten in R 1.4.0.
>
> I think I found out what part of the problem might be.  I was comparing
> R1.3.1 and R1.4.0 code and it appears that a statement is missing in some
> of the code for R 1.4.  This is the section of code at the beginning of
> read.table.  The loop starting with 'while (nlines < 5)' will read in the
> entire file, because there is no increment of 'nlines' in the loop.  I
> traced through the code  and this is what was happening.  It then does a
> 'pushBack' of the entire file.  In tracing through the code, this is where
> is appears to be taking the time.  With the change noted below, the speed
> was similar to R 1.3.1 and the results were the same.
>
> Here is the current code with what I think is the additional statement
> needed:
>
> =================part of read.table========
>
>     nlines <- 0
>     lines <- NULL
>     while (nlines < 5) {
>         line <- readLines(file, 1, ok = TRUE)
>         if (length(line) == 0)
>             break
>         if (blank.lines.skip && length(grep("^[ \\t]*$", line)))
>             next
>         if (length(comment.char) && nchar(comment.char)) {
>             pattern <- paste("^[ \\t]*", substring(comment.char,
>                 1, 1), sep = "")
>             if (length(grep(pattern, line)))
>                 next
>         }
>         lines <- c(lines, line)
>        #
>        #  additional line required
>        #
>        nlines <- nlines+1
>     }
>     nlines <- length(lines)
>     if (!nlines) {
>         if (missing(col.names))
>             stop("no lines available in input")
>         else {
>             tmp <- vector("list", length(col.names))
>             names(tmp) <- col.names
>             class(tmp) <- "data.frame"
>             return(tmp)
>         }
>     }
>     if (all(nchar(lines) == 0))
>         stop("empty beginning of file")
>     pushBack(c(lines, lines), file)
>
> --
>
> NOTICE:  The information contained in this electronic mail transmission is
> intended by Convergys Corporation for the use of the named individual or
> entity to which it is directed and may contain information that is
> privileged or otherwise confidential.  If you have received this electronic
> mail transmission in error, please delete it from your system without
> copying or forwarding it, and notify the sender of the error by reply email
> or by telephone (collect), so that the sender's address records can be
> corrected.
>
>
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._