[Rd] [External] Re: Segfault when parsing UTF-8 text with srcrefs
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu May 30 10:00:52 CEST 2024
On 5/30/24 09:29, Barry Rowlingson wrote:
> I get an R error and no segfault:
>
> > parse(textConnection(text), srcfile = srcfile)
> Error in parse(textConnection(text), srcfile = srcfile) :
> test.r:1:1: unexpected $end
> 1: ×
> ^
>
> This is R 4.3.0, so maybe the bug has been introduced since then...
Thanks, am looking into it and have found the cause, now testing a
patch. The bug has been in the code for a long time, but whether it
causes a crash or not is non-deterministic, depending on memory layout
and content (out of bounds access).
Tomas
>
> Version and system info:
>
> > version
> _
> platform x86_64-pc-linux-gnu
> arch x86_64
> os linux-gnu
> system x86_64, linux-gnu
> status
> major 4
> minor 3.0
> year 2023
> month 04
> day 21
> svn rev 84292
> language R
> version.string R version 4.3.0 (2023-04-21)
> nickname Already Tomorrow
>
> > sessionInfo()
> R version 4.3.0 (2023-04-21)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 22.04.4 LTS
>
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
> LAPACK:
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
> <http://libopenblasp-r0.3.20.so>; LAPACK version 3.10.0
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> time zone: Europe/London
> tzcode source: system (glibc)
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.3.0
>
> On Tue, May 28, 2024 at 7:42 PM Tomas Kalibera
> <tomas.kalibera using gmail.com> wrote:
>
> This email originated outside the University. Check before
> clicking links or attachments.
>
> On 5/28/24 19:35, Hadley Wickham wrote:
> > Hi all,
> >
> > When I run the following code, R segfaults:
> >
> > text <- "×"
> > srcfile <- srcfilecopy("test.r", text)
> > parse(textConnection(text), srcfile = srcfile)
> >
> > It doesn't segfault if text is ASCII, or it's not wrapped in
> > textConnection, or srcfile isn't set.
>
> Thanks, this is because R parser doesn't support non-ASCII UTF-8
> outside
> string literals and comments, plus a missing bounds check. The
> "correct"
> result should be an R error, which I get in a debug build.
>
> The tokenizer ends up with a negative token and then when the
> parse data
> are being finalized, creating a table of token names, there is an
> out of
> bounds access (yytname array). Probably the check should go right away
> into the tokenizer.
>
> Tomas
>
> >
> > Hadley
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list