[Rd] Possible `substr` bug in UTF-8 Corner Case
brodie gaslam
brodie.gaslam at yahoo.com
Thu Mar 29 03:53:03 CEST 2018
I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string. With a valgrind level 2 instrumented build of R-devel I get:
> string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)
==15375== Invalid read of size 1
==15375== at 0x45B3F0: substr (character.c:286)
==15375== by 0x45B3F0: do_substr (character.c:342)
==15375== by 0x4CFCB9: bcEval (eval.c:6775)
==15375== by 0x4D95AF: Rf_eval (eval.c:624)
==15375== by 0x4DAD12: R_execClosure (eval.c:1764)
==15375== by 0x4D9561: Rf_eval (eval.c:747)
==15375== by 0x507008: Rf_ReplIteration (main.c:258)
==15375== by 0x5073E7: R_ReplConsole (main.c:308)
==15375== by 0x507494: run_Rmainloop (main.c:1082)
==15375== by 0x41A8E6: main (Rmain.c:29)
==15375== Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd
==15375== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15375== by 0x51033E: GetNewPage (memory.c:888)
==15375== by 0x511FC0: Rf_allocVector3 (memory.c:2691)
==15375== by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
==15375== by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
==15375== by 0x4657AC: coerceToVectorList (coerce.c:892)
==15375== by 0x4657AC: Rf_coerceVector (coerce.c:1293)
==15375== by 0x4660EB: ascommon (coerce.c:1369)
==15375== by 0x4667C0: do_asvector (coerce.c:1544)
==15375== by 0x4CFCB9: bcEval (eval.c:6775)
==15375== by 0x4D95AF: Rf_eval (eval.c:624)
==15375== by 0x4DAD12: R_execClosure (eval.c:1764)
==15375== by 0x515EF7: dispatchMethod (objects.c:408)
==15375== by 0x516379: Rf_usemethod (objects.c:458)
==15375== by 0x516694: do_usemethod (objects.c:543)
==15375==
[1] "abc<ee>"
Here is a patch for the native version of `substr` that highlights the problem and a possible fix. Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character".
Index: src/main/character.c
===================================================================
--- src/main/character.c (revision 74482)
+++ src/main/character.c (working copy)
@@ -283,7 +283,7 @@
for (i = 0; i < so && str < end; i++) {
int used = utf8clen(*str);
if (i < sa - 1) { str += used; continue; }
- for (j = 0; j < used; j++) *buf++ = *str++;
+ for (j = 0; j < used && str < end; j++) *buf++ = *str++;
}
} else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
The change above removed the valgrind error for me. I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay.
I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism.
Apologies in advance if this turns out to be a false alarm caused by an error on my part.
Best,
Brodie.
PS: apologies also if the formatting of this e-mail is bad. I have not figured out how to get plaintext working properly with yahoo.
More information about the R-devel
mailing list