[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?

Sat Sep 15 21:44:13 CEST 2012

Why not write the RDS file more atomically - write it to a
temporary file and rename that file to its final name when
it is completely written?  E.g.,

saveRDS.atomically
function (object, file, ...) 
{
    tfile <- tempfile(basename(file), dirname(file))
    on.exit(if (file.exists(tfile)) unlink(tfile))
    retval <- saveRDS(object, tfile, ...)
    if (!file.rename(tfile, file)) { # perhaps want an if(file.exists(file))unlink(file) first
        stop("Cannot rename temporary file ", tfile, " to ", 
            file)
    }
    invisible(retval)
}

(The file.rename may be tripped up by an overeager virus checker looking
at the newly created tfile.  I don't know the best way to deal with that.)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
> Of Henrik Bengtsson
> Sent: Saturday, September 15, 2012 10:22 AM
> To: R-devel
> Subject: [Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
> 
> I hardly know anything about the format used in (non-compressed)
> serialization/RDS, but hoping someone with more knowledge could give
> me some feedback;
> 
> Consider two R processes running in parallel on the same unknown file
> system.  Both of them write and read to the same RDS file foo.rds
> (without compression) at random times using saveRDS(object,
> file="foo.rds", compress=FALSE) and object2 <-
> readRDS(file="foo.rds").  This happens frequently enough such that
> there is a risk for the two processes to write to the same "foo.rds"
> file at the same time (here one needs to acknowledge that file updates
> are not atomic nor instant).
> 
> To simulate the event that two processes writes to the same file at
> the same time (and non-atomically) results in a interweaved/appended
> "foo.rds" file, I manually corrupted "foo.rds" by
> inserting/dropping/replacing a single random byte.  It appears that
> readRDS() will detect this simple event, by throwing an error on
> "unknown input format", which is what I want.  My question is now, is
> it reasonable to assume that if two or more processes happen to write
> to the same RDS file at the same time, it is extremely unlikely (*)
> that they would generate a file that would pass as valid by readRDS()?
>  (*) extremely unlikely = if all of us would run this toy example we
> would not end up with a non-detect but still corrupt "foo.rds" file
> in, say, 10000 years.
> 
> Background: The R.cache package allows memoization (caching of
> results) to file such that the cache is persistent across R sessions.
> The persistent part is achieved by writing cache files to the same
> file directory.  This is safe when you run a single process, and even
> if readRDS() would fail to read a cache file it is no big deal; the
> memoization will just fail and the results will be recalculated and be
> resaved.  The questions is what happens if you run this in parallel
> and push it to the extreme; is there a risk that the memoization will
> properly return but with invalid results.  I prefer not having to
> synchronize this with a mutex/semaphore/common server, but instead
> rely on this try-an-see approach (cf. the Ethernet protocol on shared
> medium).  My guess (and hope) is that the risk is extremely unlikely
> (*), but I'd like to hear if someone else thinks otherwise.
> 
> Thanks,
> 
> Henrik
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel