[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
William Dunlap
wdunlap at tibco.com
Sat Sep 15 21:44:13 CEST 2012
Why not write the RDS file more atomically - write it to a
temporary file and rename that file to its final name when
it is completely written? E.g.,
saveRDS.atomically
function (object, file, ...)
{
tfile <- tempfile(basename(file), dirname(file))
on.exit(if (file.exists(tfile)) unlink(tfile))
retval <- saveRDS(object, tfile, ...)
if (!file.rename(tfile, file)) { # perhaps want an if(file.exists(file))unlink(file) first
stop("Cannot rename temporary file ", tfile, " to ",
file)
}
invisible(retval)
}
(The file.rename may be tripped up by an overeager virus checker looking
at the newly created tfile. I don't know the best way to deal with that.)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
> Of Henrik Bengtsson
> Sent: Saturday, September 15, 2012 10:22 AM
> To: R-devel
> Subject: [Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
>
> I hardly know anything about the format used in (non-compressed)
> serialization/RDS, but hoping someone with more knowledge could give
> me some feedback;
>
> Consider two R processes running in parallel on the same unknown file
> system. Both of them write and read to the same RDS file foo.rds
> (without compression) at random times using saveRDS(object,
> file="foo.rds", compress=FALSE) and object2 <-
> readRDS(file="foo.rds"). This happens frequently enough such that
> there is a risk for the two processes to write to the same "foo.rds"
> file at the same time (here one needs to acknowledge that file updates
> are not atomic nor instant).
>
> To simulate the event that two processes writes to the same file at
> the same time (and non-atomically) results in a interweaved/appended
> "foo.rds" file, I manually corrupted "foo.rds" by
> inserting/dropping/replacing a single random byte. It appears that
> readRDS() will detect this simple event, by throwing an error on
> "unknown input format", which is what I want. My question is now, is
> it reasonable to assume that if two or more processes happen to write
> to the same RDS file at the same time, it is extremely unlikely (*)
> that they would generate a file that would pass as valid by readRDS()?
> (*) extremely unlikely = if all of us would run this toy example we
> would not end up with a non-detect but still corrupt "foo.rds" file
> in, say, 10000 years.
>
> Background: The R.cache package allows memoization (caching of
> results) to file such that the cache is persistent across R sessions.
> The persistent part is achieved by writing cache files to the same
> file directory. This is safe when you run a single process, and even
> if readRDS() would fail to read a cache file it is no big deal; the
> memoization will just fail and the results will be recalculated and be
> resaved. The questions is what happens if you run this in parallel
> and push it to the extreme; is there a risk that the memoization will
> properly return but with invalid results. I prefer not having to
> synchronize this with a mutex/semaphore/common server, but instead
> rely on this try-an-see approach (cf. the Ethernet protocol on shared
> medium). My guess (and hope) is that the risk is extremely unlikely
> (*), but I'd like to hear if someone else thinks otherwise.
>
> Thanks,
>
> Henrik
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list