[Rd] serialize() to via temporary file is heaps faster than doing it directly (on Windows)
Henrik Bengtsson
hb at stat.berkeley.edu
Fri Aug 29 21:43:37 CEST 2008
I just want to re-post this thread in case it slipped through the
"summer sieve" of someone that might be interested and/or has a real
solution beyond my serialize2() patch.
Cheers
Henrik
On Thu, Jul 24, 2008 at 8:10 PM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> Hi,
>
> FYI, I just notice that on Windows (but not Linux) it is orders of
> magnitude (below it's 50x) faster to serialize() and object to a
> temporary file and then read it back, than to serialize to an object
> directly. This has for instance impact on how fast digest::digest()
> can provide a checksum.
>
> Example:
> x <- 1:1e7;
> t1 <- system.time(raw1 <- serialize(x, connection=NULL));
> print(t1);
> # user system elapsed
> # 174.23 129.35 304.70 ## 5 minutes
> t2 <- system.time(raw2 <- serialize2(x, connection=NULL));
> print(t2);
> # user system elapsed
> # 2.19 0.18 5.72 ## 5 seconds
> print(t1/t2);
> # user system elapsed
> # 79.55708 718.61111 53.26923
> stopifnot(identical(raw1, raw2));
>
> where serialize2() is serialize():ing to file and reading the results back:
>
> serialize2 <- function(object, connection, ...) {
> if (is.null(connection)) {
> # It is faster to serialize to a temporary file and read it back
> pathname <- tempfile();
> con <- file(pathname, open="wb");
> on.exit({
> if (!is.null(con))
> close(con);
> if (file.exists(pathname))
> file.remove(pathname);
> });
> base::serialize(object, connection=con, ...);
> close(con);
> con <- NULL;
> fileSize <- file.info(pathname)$size;
> readBin(pathname, what="raw", n=fileSize);
> } else {
> base::serialize(object, connection=connection, ...);
> }
> } # serialize2()
>
> The above benchmarking was done in a fresh R v2.7.1 session on WinXP Pro:
>
>> sessionInfo()
> R version 2.7.1 Patched (2008-06-27 r46012)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON
> ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
> When I do the same on a Linux machine there is no difference:
>
>> sessionInfo()
> R version 2.7.1 (2008-06-23)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> Is there an obvious reason (and an obvious fix) for this?
>
> Cheers
>
> Henrik
>
More information about the R-devel
mailing list