Sat Aug 6 05:23:03 CEST 2016

Dear R Devel,

In a thread this morning Luke Tierney mentioned that R's way of
garbage collecting is going to change soon in 3.4.0. I couldn't find
this info on Google but I wanted to share what I had been discussing
in another forum, in case now is not too late to raise considerations
which could affect the design of planned changes to R's garbage
collection facilities.

I ran into a problem when trying to get R to quickly load some vectors
from disk. R should be able to do this efficiently using memory
mapping. There is a package 'ff' which implements efficient loading of
disk-based vectors using memory mapping. It works pretty well, but the
problem is that it creates a separate data type - the vectors are not
"native" R vectors. There are some wrapper functions in a package
'ffbase' which allow people to use common functions like 'sum' on
these 'ff' vectors. However, a new wrapper has to be written for every
such function, and I guess the 'ffbase' authors do not have time to
write wrappers that are as efficient as the native R functions - in my
testing, there was a 10x slow-down for 'sum'.

The situation is a bit wistful because an 'ff' vector and a native R
vector are basically the same data type, they both store elements
contiguously in memory. Apparently, what prevents 'ffbase' and 'ff'
from creating native R vectors is the fact that it is impossible to
assign a "finalizer" to a native R vector. We need a finalizer so that
R can tell us when a vector is being freed, so we can unmap the
associated memory/file. Ffbase maintainer Edwin de Jonge was even
skeptical that CRAN would accept a package implementing the hack I had
proposed to simulate native R vectors from mmap'ed 'ff' vectors. The
issue is discussed here:


Of course, weak references and external pointers allow finalizers to
be assigned to objects, but as I understand it, such objects are
separate types from vectors - there is no way in R to synthesize a
native vector endowed with a finalizer - something which could be
passed directly to built-in functions like 'sum'.

I think a finalizer facility for vectors would be useful because it
would allow us to take advantage of the memory mapping architecture
present in all modern processors, to do fast copy-free operations on
large disk-based data structures, without having to re-implement
internal functions like 'sum' which are essentially the same algorithm
no matter where the data is stored.

Thank you,


