[Rd] Possible `substr` bug in UTF-8 Corner Case

Tomas Kalibera tomas.kalibera at gmail.com
Thu Mar 29 15:11:12 CEST 2018


Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).
Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:
> I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string.  With a valgrind level 2 instrumented build of R-devel I get:
>
>> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
>> Encoding(string) <- "UTF-8"
>> substr(string, 1, 10)
> ==15375== Invalid read of size 1
> ==15375==    at 0x45B3F0: substr (character.c:286)
> ==15375==    by 0x45B3F0: do_substr (character.c:342)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x4D9561: Rf_eval (eval.c:747)
> ==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
> ==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
> ==15375==    by 0x507494: run_Rmainloop (main.c:1082)
> ==15375==    by 0x41A8E6: main (Rmain.c:29)
> ==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd
> ==15375==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==15375==    by 0x51033E: GetNewPage (memory.c:888)
> ==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
> ==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
> ==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
> ==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
> ==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
> ==15375==    by 0x4660EB: ascommon (coerce.c:1369)
> ==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
> ==15375==    by 0x516379: Rf_usemethod (objects.c:458)
> ==15375==    by 0x516694: do_usemethod (objects.c:543)
> ==15375==
> [1] "abc<ee>"
>
> Here is a patch for the native version of `substr` that highlights the problem and a possible fix.  Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character".
>
> Index: src/main/character.c
> ===================================================================
> --- src/main/character.c    (revision 74482)
> +++ src/main/character.c    (working copy)
> @@ -283,7 +283,7 @@
>      for (i = 0; i < so && str < end; i++) {
>          int used = utf8clen(*str);
>          if (i < sa - 1) { str += used; continue; }
> -        for (j = 0; j < used; j++) *buf++ = *str++;
> +        for (j = 0; j < used && str < end; j++) *buf++ = *str++;
>      }
>       } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
>      for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
>
> The change above removed the valgrind error for me.  I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay.
>
> I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism.
>
> Apologies in advance if this turns out to be a false alarm caused by an error on my part.
>
> Best,
>
> Brodie.
>
> PS: apologies also if the formatting of this e-mail is bad.  I have not figured out how to get plaintext working properly with yahoo.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list