[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler
maechler at stat.math.ethz.ch
Sat May 30 19:32:52 CEST 2009
>>>>> "vQ" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>> on Sat, 30 May 2009 11:16:43 +0200 writes:
vQ> Martin Maechler wrote:
>> Hi Waclav (and other interested parties),
>>
>> I have committed my working version of src/main/coerce.c
>> so you can prepare your patch against that.
>>
vQ> some further investigation and reflections on the code in StringFromReal
vQ> (henceforth SFR), src/main/coerce.c:315 (as in the
vQ> patched version, now in r-devel).
vQ> petr's elim_trailing (renamed to dropTrailing,
vQ> henceforth referred to as DT) takes as input a const
vQ> char*, and returns a const char*.
vQ> const-ness of the return is not a problem; it is fed
vQ> into mkChar, which (via mkCharLenCE) makes a local
vQ> memcpy of the string, and there's no violation of the
vQ> contract here.
vQ> const-ness of the input is a consequence of the return
vQ> type of EncodeReal (henceforth EC). however, it is
vQ> hardly ever, "in principle", a good idea to
vQ> destructively modify const input (as DT does) if it
vQ> comes from a function that explicitly provides it as
vQ> const (as ER does).
yes, I think we've alway agreed on that "in principle".
vQ> the first question is, why does ER return the string as const? it
vQ> appears that the returned pointer provides the address of a buffer used
vQ> internally in ER, which is allocated *statically*. that is, each call
vQ> to ER operates on the same memory location, and each call to ER returns
vQ> the address of that same location. i suspect this is intended to be a
vQ> smart optimization, to avoid heap- or stack-allocating a new buffer in
vQ> each call to ER, and deallocating it after use. however, this appraoch
vQ> is problematic, in that any two calls to ER return the address of the
vQ> same piece of memory, and this may easily lead to data corruption.
Well, that would be ok if R could be used "threaded" / parallel / ...
and we all know that there are many other pieces of code {not
just R's "own", but also in Fortran/C algorithms ..} that are
not thread-safe.
"Yes, of course", R looks like a horrible piece of software to
some, because of that
vQ> under the assumption that the content of this piece of memory is copied
vQ> before any destructive use, and that after the string is copied the
vQ> address is not further distributed, the hack is relatively harmless.
vQ> this is what mkChar (via mkCharLenCE) does; in SFR it copies the
vQ> content of s with memcpy, and wraps it into a SEXP that becomes the
vQ> return value from SFR.
exactly.
vQ> the original author of this hack seems to have had some concern about
vQ> exporting (from ER) the address of a static buffer, hence the returned
vQ> buffer is const. in principle, this should prevent corruption of the
vQ> buffer's content in situations such as
vQ> // hypothetical
vQ> char *p1 = ER(...);
vQ> // p1 is some string returned from ER
vQ> char p2 = ER(...);
vQ> // p2 is some other string returned from ER
vQ> // some modifications performed on the string referred to by p1
vQ> p1[0] = 'x';
vQ> // p2[0] is 'x' -- possible data corruption
vQ> still worse in a scenario with concurrent calls to ER.
(which will not happen in the near future)
vQ> however, since the output from ER is const, this is no longer possible
vQ> -- at least, not without a deconstifying cast the petr style. the
vQ> problem with petr's solution is not only that it modifies shared memory
vQ> purposefully qualified as const (by virtue of ER's return type), but
vQ> also that it effectively distributes the address for further use.
vQ> unfortunately, like most of the r source code, ER is not appropriately
vQ> commented at the declaration and the definition, and without looking at
vQ> the code, one can hardly have any clue that ER always return the same
vQ> address of a static location. while the original developer might be
vQ> careful enough not to misuse ER, in a large multideveloper project it's
vQ> hard expect that from others. petr's function is precisely an example
vQ> of such misuse, and as it adds (again, without an appropriate comment) a
vQ> step of indirection; any use of petr's function other than what you have
vQ> in SFR (and can you guarantee no one will ever use DT for other
vQ> purposes?) is even more likely to end up in data corruption.
you have a point here, and as a consequence, I'm proposing to
put the following version of DT into the source :
------------------------------------------------------------------------
/* Note that we modify a 'const char*' which is unsafe in general,
* but ok in the context of filtering an Encode*() value into mkChar(): */
static const char* dropTrailing0(char *s, char cdec)
{
char *p = s;
for (p = s; *p; p++) {
if(*p == cdec) {
char *replace = p++;
while ('0' <= *p && *p <= '9')
if(*(p++) != '0')
replace = p;
while((*(replace++) = *(p++)))
;
break;
}
}
return s;
}
------------------------------------------------------------------------
so it has a comment along the lines you suggest, *and* as static
is not callable from outside coerce.c
vQ> one simple way to improve the code is as follows; instead of (simplified)
vQ> const char* dropTrailing(const char* s, ...) {
vQ> const char *p = s;
vQ> char *replace;
vQ> ...
vQ> replace = (char*) p;
vQ> ...
vQ> return s; }
vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ...
vQ> you can have something like
vQ> const char* dropTrailing(char* s, ...) {
vQ> char *p = s, *replace;
vQ> ...
vQ> replace = p;
vQ> ...
vQ> return s; }
vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
vQ> where it is clear, from DT's signature, that it may (as it purposefully
vQ> does, in fact) modify the content of s. that is, you drop the
vQ> promise-not-to-modify contract in DT, and move the need for
vQ> deconstifying ER's return out of DT, making it more explicit.
vQ> however, this is still an ad hoc hack; it still breaks the original
vQ> developer's assumption (if i'm correct) that the return from ER
vQ> (pointing to its internal buffer) should not be destructively modified
vQ> outside of ER.
I think that's a misinterpretation of the history of ER.
IIRC, the main issue when changing from 'char*' to 'const char*'
in *many* places "simultaneously"
was the introduction of hashes / cashes of strings (in
mkChar(.)) in order to make R-level character handling (and
storage of STRSXP/CHARSXP) much more efficient.
vQ> another issue is that even making the return from ER const does not
vQ> protect against data corruption. for example,
vQ> const char *p1 = ER(...)
vQ> // p1 is some string returned from ER
vQ> const char *p2 = ER(...)
vQ> // p2 is some other string returned from ER
vQ> // but p1 == p2
vQ> if p1 is used after the second call to ER, it's likely to lead to data
vQ> corruption problems. frankly, i'd consider the design of ER essentially
vQ> flawed. removing const from ER's return is obviously not an option; it
vQ> would certainly be safer to have it return a pointer to a heap-allocated
vQ> buffer, but then the returned buffer would have to be freed at some
vQ> point in the client code, and this would require a whole cascade of
vQ> modifications to the r source code. since it usually is more funny to
vQ> introduce new bugs than repair old ones, i'd not expect r core to be
vQ> ever willing to invest time in this.
vQ> the following seem acceptable and relatively reasonable ways to address
vQ> the issue:
vQ> (1) noop: leave as is.
vQ> (2) minimal: adopt the (inessential) changes suggested in my previous post:
vQ> const char* dropTrailing(const char* s, ...) {
vQ> char *p = (char*) s;
vQ> // no further need for (char*) casting from p to replace
vQ> ... }
vQ> (3) modify petr's solution along the lines above, i.e., have the input
vQ> in the signature non-const and deconst-cast the output from ER outside
vQ> of the call to DT.
that's what I have adopted, as I'm sure you've noticed when you
saw the code above.
So, let's get to something more interesting,
"Happy Pentecoste!"
Martin
vQ> (4) modify petr's solution to operate on a local, non-const copy of s:
vQ> const char* dropTrailing(const char* s, ...) {
vQ> int length = strlen(s);
vQ> char *ss = malloc((length+1)*sizeof(char));
vQ> memcpy(ss, s, length+1);
vQ> // work on ss rather than on s
vQ> ...
vQ> return ss; }
vQ> and don't forget to deallocate the return from DT after mkChar (or
vQ> whoever calls DT) has no more need for it. (an alternative is to use
vQ> the allocCharsxp approach of mkCharLenCE, but that would be overdoing
vQ> the job.)
vQ> to sum up: for a clean solution, i find (4) preferable. for
vQ> efficiency, (3) seems better, provided that you add a clear comment to
vQ> the effect that the assumption of non-modifiable output from ER is
vQ> purposefully violated. (2) is acceptable provided that DT is
vQ> appropriately documented (i.e., that it does not really treat s as const
vQ> char*, but effectively as non-const char*), and with a note as above.
vQ> a final, bitter remark: i'm surprised that it's so easy to have an r
vQ> core developer submit to r-devel a patch that violates an explicit
vQ> demand on the immutability of a shared, statically allocated piece of
vQ> memory. (even if it's the already existing code that actually is
vQ> problematic.)
vQ> no patch attached, you need to consider your options. hope this helps
vQ> anyway.
vQ> best,
vQ> vQ
More information about the R-devel
mailing list