[Rd] bug in strsplit?
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Fri May 29 09:49:43 CEST 2009
src/main/character.c:435-438 (do_strsplit) contains the following code:
for (i = 0; i < tlen; i++)
if (getCharCE(STRING_ELT(tok, 0)) == CE_UTF8) use_UTF8 = TRUE;
for (i = 0; i < len; i++)
if (getCharCE(STRING_ELT(x, 0)) == CE_UTF8) use_UTF8 = TRUE;
since both loops iterate over loop-invariant expressions and statements,
either the loops are redundant, or the fixed index '0' was meant to
actually be the variable i. i guess it's the latter, hence 'bug?' in
the subject.
it also appears that if *any* element of tok (or x) positively passes
the test, use_UTF8 is set to TRUE; in such a case, further checks make
no sense. the following rewrite cuts the inessential computation:
for (i = 0; i < tlen; i++)
if (getCharCE(STRING_ELT(tok, i)) == CE_UTF8) {
use_UTF8 = TRUE;
break; }
for (i = 0; i < len; i++)
if (getCharCE(STRING_ELT(x, i)) == CE_UTF8) {
use_UTF8 = TRUE;
break; }
since the pattern is repetitive, the following generic approach would
help (and the macro could possibly be reused in other places):
#define CHECK_CE(CHARACTER, LENGTH, USEUTF8) \
for (i = 0; i < (LENGTH); i++) \
if (getCharCE(STRING_ELT((CHARACTER), i)) == CE_UTF8) { \
(USEUTF8) = TRUE; \
break; }
CHECK_CE(tok, tlen, use_UTF8)
CHECK_CE(x, len, use_UTF8)
if you like it, i can provide a patch.
vQ
More information about the R-devel
mailing list