[Rd] Possible `substr` bug in UTF-8 Corner Case

Thu Mar 29 03:53:03 CEST 2018

I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string.  With a valgrind level 2 instrumented build of R-devel I get:

> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)
==15375== Invalid read of size 1
==15375==    at 0x45B3F0: substr (character.c:286)
==15375==    by 0x45B3F0: do_substr (character.c:342)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x4D9561: Rf_eval (eval.c:747)
==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
==15375==    by 0x507494: run_Rmainloop (main.c:1082)
==15375==    by 0x41A8E6: main (Rmain.c:29)
==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd
==15375==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15375==    by 0x51033E: GetNewPage (memory.c:888)
==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
==15375==    by 0x4660EB: ascommon (coerce.c:1369)
==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
==15375==    by 0x516379: Rf_usemethod (objects.c:458)
==15375==    by 0x516694: do_usemethod (objects.c:543)
==15375== 
[1] "abc<ee>"

Here is a patch for the native version of `substr` that highlights the problem and a possible fix.  Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character".

Index: src/main/character.c
===================================================================

--- src/main/character.c    (revision 74482)
+++ src/main/character.c    (working copy)
@@ -283,7 +283,7 @@
    for (i = 0; i < so && str < end; i++) {
        int used = utf8clen(*str);
        if (i < sa - 1) { str += used; continue; }
-        for (j = 0; j < used; j++) *buf++ = *str++;
+        for (j = 0; j < used && str < end; j++) *buf++ = *str++;
    }
     } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
    for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;

The change above removed the valgrind error for me.  I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay.

I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism.

Apologies in advance if this turns out to be a false alarm caused by an error on my part.

Best,

Brodie.

PS: apologies also if the formatting of this e-mail is bad.  I have not figured out how to get plaintext working properly with yahoo.