[Rd] Segfault when parsing UTF-8 text with srcrefs
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Fri May 31 16:36:11 CEST 2024
On 5/28/24 20:41, Tomas Kalibera wrote:
>
> On 5/28/24 19:35, Hadley Wickham wrote:
>> Hi all,
>>
>> When I run the following code, R segfaults:
>>
>> text <- "×"
>> srcfile <- srcfilecopy("test.r", text)
>> parse(textConnection(text), srcfile = srcfile)
>>
>> It doesn't segfault if text is ASCII, or it's not wrapped in
>> textConnection, or srcfile isn't set.
>
> Thanks, this is because R parser doesn't support non-ASCII UTF-8
> outside string literals and comments, plus a missing bounds check. The
> "correct" result should be an R error, which I get in a debug build.
To be more precise, the current implementation of the parser allows a
bit more than that, but there are recommendations in WRE 1.1.5 "Package
subdirectories" on (not) using non-ASCII characters in packages.
"×" (\ud7) is not an allowed symbol name and the current implementation
should throw an error.
> The tokenizer ends up with a negative token and then when the parse
> data are being finalized, creating a table of token names, there is an
> out of bounds access (yytname array). Probably the check should go
> right away into the tokenizer.
Fixed in R-devel.
Tomas
>
> Tomas
>
>>
>> Hadley
>>
More information about the R-devel
mailing list