[Rd] Segfault when parsing UTF-8 text with srcrefs

Tue May 28 20:41:31 CEST 2024

On 5/28/24 19:35, Hadley Wickham wrote:
> Hi all,
>
> When I run the following code, R segfaults:
>
> text <- "×"
> srcfile <- srcfilecopy("test.r", text)
> parse(textConnection(text), srcfile = srcfile)
>
> It doesn't segfault if text is ASCII, or it's not wrapped in
> textConnection, or srcfile isn't set.

Thanks, this is because R parser doesn't support non-ASCII UTF-8 outside 
string literals and comments, plus a missing bounds check. The "correct" 
result should be an R error, which I get in a debug build.

The tokenizer ends up with a negative token and then when the parse data 
are being finalized, creating a table of token names, there is an out of 
bounds access (yytname array). Probably the check should go right away 
into the tokenizer.

Tomas

>
> Hadley
>