[Rd] must .Call C functions return SEXP?

Thu Oct 28 16:59:26 CEST 2010

On Oct 28, 2010, at 9:48 AM, Andrew Piskorski wrote:

> On Thu, Oct 28, 2010 at 12:15:56AM -0400, Simon Urbanek wrote:
> 
>>> Reason I ask, is I've written some R code which allocates two long
>>> lists, and then calls a C function with .Call.  My C code writes to
>>> those two pre-allocated lists,
> 
>> That's bad! All arguments are essentially read-only so you should
>> never write into them! 
> 
> I don't see how.  (So, what am I missing?)  The R docs themselves
> state that the main point of using .Call rather than .C is that .Call
> does not do any extra copying and gives one direct access to the R
> objects.  (This is indeed very useful, e.g. to reorder a large matrix
> in seconds rather than hours.)
> 

Exactly - direct access without copying which means that you are responsible for not modifying anything you don't own. Again, remember that R has copy by value semantics so functions can never modify their arguments (at least from user's point of view).

> I could allocate the two lists in my C code, but so far it was more convenient to so in R.  

You don't just allocate them, you also assign them to an environment which is where the trouble starts. Let's look at a very simple example:

/* do NOT do that kids!! */
SEXP foo(SEXP x) {
  REAL(x)[0] = 1;
  return x;
}

The expected behavior if R was not performing any tricks behind the scenes should be in theory:

> a = 0
> .Call("foo", a)
[1] 1
> a
[1] 0

The reason is that in the S language all arguments are passed by value so .Call("foo", a) really means .Call("foo", 0) so you only change the "0" but not a. However, R attempts to prevent copying so both the environment holding "a" *and* the argument passed to .Call will share memory.
Now, why is it a bad idea to modify arguments? This is why (this is actually run in R):

> a = 0
> b = a
> .Call("foo", a)
[1] 1
> a
[1] 1
> b
[1] 1

Because R assumes that you don't mess with the arguments, it also optimizes b to point to the same object as a which you then modify. Therefor the moment you start modifying argument all bets are off, because you cannot know which objects have been optimized to share the same memory so you don't know what else you'll modify. (More on how you can detect it further down).

There are also rational problems with that:
> .Call("foo", 0)
[1] 1
How can you change a "0" constant to 1 ?!?

> What possible difference in behavior can there be between the two approaches?
> 

The only way to allocate vectors is with things like numeric(10) but you may *not* assign it anywhere - that's why .C uses construct like .C(numeric(10), ...) to create result space for DUP=FALSE but the only reason to do so is because it has no choice. You could call .Call(numeric(10), ...) but that sort of defeats the purpose and is somewhat dangerous from user's point of view since your C code would assume that you don't pass anything else (like a variable or a constant) but a "malicious" user could pass anything...

>> R has pass-by-value(!) semantics, so semantically you code has
>> nothing to do with the result.1 and result.2 variables since only
>> their *values* are guaranteed to be passed (possibly a copy).
> 
> Clearly C code called from .Call must be allowed to construct R
> objects, as that's how much of R itself is implemented, and further
> down, it's what you recommend I should do instead.
> 
> But why does it follow that C code must never modify an object
> initially allocated by R code?  Are you saying there is some special
> magic difference in the state of an object allocated by R's C code
> vs. one allocated by R code?  If so, what is it?
> 

It's magic of all objects - regardless where they are allocated - and it is essentially the NAMED bits that decide whether an object is to be copied or not. The object you passed from R was not "yours" in that it was shared with the environment you assigned it to (using result.1 <- ..) and your function. If you allocate it in C you know that it's not owned by anyone else so you can safely modify it.

Now, we can go more into the internals and you can actually use NAMED to detect the cases. I'm still not recommending it for the use you mentioned (mostly because it may change without notice), but it should give you the full picture. Let's modify the example above by adding Rprintf("NAMED=%d\n", NAMED(x));

Here are the different cases:

> .Call("foo", numeric(1))
NAMED=0
[1] 1
# numeric(1) is a direct allocation so it has no reference

> a = numeric(1)
> .Call("foo", a)
NAMED=1
[1] 1
# numeric(1) was direct allocation then assigned to a - so it has one reference

> b = a
> .Call("foo", a)
NAMED=2
[1] 1
# the numeric(1) value in both a and b has now two references
# note that it is not a real reference count - it has only the three states above, so removing b doesn't help

> .Call("foo", 1)
NAMED=2
[1] 1
# constants are always flagged to duplicate because they all could share memory (the real story is a bit different but that's one explanation ;))

So if you wanted to optimize you could treat the above cases differently and, yes, using a=numeric(1); .Call("foo",a) *should* have NAMED=1 and thus be safe to modify - but I would worry about any code that doesn't check that since it can have unwanted effects without anyone noticing.

> What is the potential problem here, that the garbage collector will suddenly run while my C code is in the middle of writing to an R list? Yes, if the gc is going to move the object elsewhere, that would be very bad.

GC doesn't move anything - it only releases unreferenced objects.

>  But it looks to me like that cannot happen, because lots of the R implementation itself would fail badly if it did.
> 
> E.g.:  The PROTECT call is used to increment reference counts,

There are no reference counts in R, PROTECT just adds the object to the protection stack (which is that same as adding it to any list or vector that is protected).

> but I see no guarantees that it is atomic with the operations that allocate objects.  I see no mutexes or other barriers in C code to prevent the gc from running, thus implying that it *can't* run until the C function completes. And R is single threaded, of course.  But what about signal handlers, could they ever invoke R's gc?

C code cannot be interrupted exactly for this reason. However, gc can occur in any call to R API which is why PROTECT is needed in those cases.

> Also, I was initially surprised not to find any matrix C APIs, but grepping for examples (sorry, I don't remember exactly which functions) showed me that the apparently accepted way to do matrix operations from C is to simply assume R's column-first dense matrix order, and access the 2D matrix as a flat 1D vector.  (Which is easy.)
> 

Yes, that's what most sane programs handling matrices do ;).

>> The fact that internally R attempts to avoid copying for performance
>> reasons is the only reason why your code may have appeared to work,
>> but it's invalid!
> 
> I will probably change my code to allocate a new list from the C code
> and return that, as you recommend.  My main reason for doing the
> allocation in R was just that it was simpler, especially given the
> very limited documentation of R's C API.
> 
> But, I didn't see anything in the "Writing R Extensions" doc saying
> that what my code is doing is "invalid", and more importantly, I don't
> see why it would or should be invalid...
> 
> I'd still like to better understand why you think doing the initial
> allocation of an object in R rather than C code is such a problem.  So
> far, I don't see any way that the R interpreter could ever tell the
> difference.
> 
> Wait, or is the only objection here that I'm using C in a way that
> makes pass-by-reference semantics visible to my R code?  Which will
> work completely correctly, but is not the The Proper R Way?
> 

See above - it breaks the assumptions that R makes so you can change things you don't intend to. Also the internal optimizations may change in the future so I would not count on it.

> I don't actually need pass-by-reference behavior here at all, but I
> can imagine cases where I might want it, so I'd like to understand
> your objections better.  Is using C to implement pass-by-reference
> actually Broken, or merely Ugly?  From my reasons above, I think it
> will always work correctly and thus is not Broken.  But of course
> given R's devotion to pass-by-value, it could be considered
> unacceptably Ugly.
> 

I hope it sheds some light on it.

Cheers,
Simon