[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?

Sat Sep 15 19:21:51 CEST 2012

I hardly know anything about the format used in (non-compressed)
serialization/RDS, but hoping someone with more knowledge could give
me some feedback;

Consider two R processes running in parallel on the same unknown file
system.  Both of them write and read to the same RDS file foo.rds
(without compression) at random times using saveRDS(object,
file="foo.rds", compress=FALSE) and object2 <-
readRDS(file="foo.rds").  This happens frequently enough such that
there is a risk for the two processes to write to the same "foo.rds"
file at the same time (here one needs to acknowledge that file updates
are not atomic nor instant).

To simulate the event that two processes writes to the same file at
the same time (and non-atomically) results in a interweaved/appended
"foo.rds" file, I manually corrupted "foo.rds" by
inserting/dropping/replacing a single random byte.  It appears that
readRDS() will detect this simple event, by throwing an error on
"unknown input format", which is what I want.  My question is now, is
it reasonable to assume that if two or more processes happen to write
to the same RDS file at the same time, it is extremely unlikely (*)
that they would generate a file that would pass as valid by readRDS()?
 (*) extremely unlikely = if all of us would run this toy example we
would not end up with a non-detect but still corrupt "foo.rds" file
in, say, 10000 years.

Background: The R.cache package allows memoization (caching of
results) to file such that the cache is persistent across R sessions.
The persistent part is achieved by writing cache files to the same
file directory.  This is safe when you run a single process, and even
if readRDS() would fail to read a cache file it is no big deal; the
memoization will just fail and the results will be recalculated and be
resaved.  The questions is what happens if you run this in parallel
and push it to the extreme; is there a risk that the memoization will
properly return but with invalid results.  I prefer not having to
synchronize this with a mutex/semaphore/common server, but instead
rely on this try-an-see approach (cf. the Ethernet protocol on shared
medium).  My guess (and hope) is that the risk is extremely unlikely
(*), but I'd like to hear if someone else thinks otherwise.

Thanks,

Henrik