[Rd] write.table with row.names=FALSE unnecessarily slow?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Mar 11 11:28:26 CET 2008
This is a pretty extreme case: why not use write() to write a single
column? (It's a bit faster than your patched timing.)
In a more realistic test of 10 columns of 1 million rows I see a speedup
from 12.2 to 9.7 seconds.
So I'll add the patch, but think that significant speedups will be quite
rare.
BTW, this seems to be one of the places where we are paying the price of
the CHARSXP cache: system.time(as.character(1:1e7)) has got a lot slower.
Maybe some further tuning is called for.
On Mon, 10 Mar 2008, Martin Morgan wrote:
> I neglected to include my test case,
>
>> df <- data.frame(x=1:(10^7))
>
> Martin
>
> Martin Morgan <mtmorgan at fhcrc.org> writes:
>
>> write.table with large data frames takes quite a long time
>>
>>> system.time({
>> + write.table(df, '/tmp/dftest.txt', row.names=FALSE)
>> + }, gcFirst=TRUE)
>> user system elapsed
>> 97.302 1.532 98.837
>>
>> A reason is because dimnames is always called, causing 'anonymous' row
>> names to be created as character vectors. Avoiding this in
>> src/library/utils, along the lines of
>>
>> Index: write.table.R
>> ===================================================================
>> --- write.table.R (revision 44717)
>> +++ write.table.R (working copy)
>> @@ -27,13 +27,18 @@
>>
>> if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
>>
>> + makeRownames <- is.logical(row.names) && !is.na(row.names) &&
>> + row.names==TRUE
>> + makeColnames <- is.logical(col.names) && !is.na(col.names) &&
>> + col.names==TRUE
>> if(is.matrix(x)) {
>> ## fix up dimnames as as.data.frame would
>> p <- ncol(x)
>> d <- dimnames(x)
>> if(is.null(d)) d <- list(NULL, NULL)
>> - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
>> - if(is.null(d[[2]]) && p > 0) d[[2]] <- paste("V", 1:p, sep="")
>> + if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
>> + if(is.null(d[[2]]) && p > 0 && makeColnames)
>> + d[[2]] <- paste("V", 1:p, sep="")
>> if(is.logical(quote) && quote)
>> quote <- if(is.character(x)) seq_len(p) else numeric(0)
>> } else {
>> @@ -53,8 +58,8 @@
>> quote <- ord[quote]; quote <- quote[quote > 0]
>> }
>> }
>> - d <- dimnames(x)
>> - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
>> + d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
>> + if (makeColnames==TRUE) names(x) else NULL)
>> p <- ncol(x)
>> }
>> nocols <- p==0
>>
>> improves performance at least in proportion to nrow(x):
>>
>>> system.time({
>> + write.table(df, '/tmp/dftest1.txt', row.names=FALSE)
>> + }, gcFirst=TRUE)
>> user system elapsed
>> 8.132 0.608 8.899
>>
>> Martin
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list