[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function
Tomas Kalibera
tomas.kalibera at gmail.com
Mon Apr 9 10:42:25 CEST 2018
Hi Patrick,
thanks for your comments on the bug, just to clarify - one could
reproduce the bug simply using file() and readLines(). The parser saw a
real end of file as (incorrectly) communicated to it by lower level
connections code - there is no design issue related in the parser (nor
elsewhere), it was a bug in connections code and is now fixed.
You can specify source encoding in "file()" or "source()" to tell R that
the source file is in that given encoding. R will convert the file
contents to the current native encoding of the R session. If in doubt,
please check the documentation ?file, ?source, ?readLines, ?Encoding for
the details.
The observation that "я" is represented as 0xff (-1 as signed char) and
R_EOF/EOF is -1 (but integer) was related to the bug, well spotted.
Best
Tomas
On 08/28/2017 02:24 PM, Patrick Perry wrote:
> My understanding (which could be wrong) is that when you source a file,
> it first gets translated to your native locale and then parsed. When you
> parse a character vector, it does not get translated.
>
> In your locale, every "я" character (U+044F) gets replaced by the byte
> "\xFF":
>
>> iconv("\u044f", "UTF-8", "Windows-1251")
> [1] "\xff"
>
> I suspect that particular value causes trouble for the R parser, which
> uses a stack of previously-seen characters (include/Defn.h):
>
> LibExtern char R_ParseContext[PARSE_CONTEXT_SIZE] INI_as("");
>
> And at various places checks whether the context character is EOF. That
> character is defined as
>
> #define R_EOF -1
>
> Which, when cast to a char, is 0xFF.
>
> I suspect that your example is revealing two bugs:
>
> 1) The R parser seems to have trouble with native characters encoded as
> 0xFF. It's possible that, since R strings can't contain 0x00, this can
> be fixed by changing the definition of R_EOF to
>
> #define R_EOF 0
>
>
> 2) The other bug is that, as I understand the situation, "source" will
> fail if the file contains a character that cannot be represented in your
> native locale. This is a harder bug to tackle because of the way file()
> and the other connection methods are designed, where they translate the
> input to your native locale. I don't know if it's possible to override
> this behavior, and have them translate input to UTF-8 instead.
>
>
>
> Patrick
>
>
> ---
>
> On Mon Aug 28 11:27:07 CEST 2017 Владимир Панфилов
> <vladimirpanfilov at gmail.com> wrote:
>
> Hello,
>
> I do not have an account on R Bugzilla, so I will post my bug report here.
> I want to report a very old bug in base R *source()* function. It relates
> to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
> reason if the UTF-8 script is containing cyrillic letter *"я"*, the script
> execution is interrupted directly on this letter (btw the same scripts are
> sourcing fine when they are encoded in the systems CP1251 encoding).
>
> Let's consider the following script that prints random russian words:
>
>
>> /
> />/
> />/ *print("Осень")print("Ёжик")print("трясина")print("тест")*
> /
>
> When this script is sourced we get INCOMPLETE_STRING error:
>
>
>> /
> />/
> />/
> />/
> />/ *source('D:/R code/test_cyr_letter.R', encoding = 'UTF-8', echo=TRUE)Error
> />/ in source("D:/R code/test_cyr_letter.R", encoding = "UTF-8", echo = TRUE)
> />/ : D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
> />/ print("Ёжик")3: print("тр ^*
> /
>
> Note that this bug is not triggered when the same file is executed using
> *eval(parse(...))*:
>
>
>> /
> />/
> />/
> />/ *> eval(parse('D:/R code/test_cyr_letter.R', encoding="UTF-8"))[1]
> />/ "Осень"[1] "Ёжик"[1] "трясина"[1] "тест"*
> /
>
> I made some reserach and noticed that *source* and *parse* functions have
> similar parts of code for reading files. After analyzing code of *source()*
> function I found out that commenting one line from it fixes this bug and
> the overrided function works fine. See this part of *source()* function
> code:
>
> *... *
>> /
> />/ *filename<- file*
> />/
> />/ * file<- file(filename, "r")*
> />/
> />/ * # on.exit(close(file)) #### COMMENT THIS LINE ####*
> />/
> />/ * if (isTRUE(keep.source)) {*
> />/
> />/ * lines<- scan(file, what="character", encoding = encoding, sep
> />>/ = "\n")*
> />/
> />/ * on.exit()*
> />/
> />/ * close(file)*
> />/
> />/ * srcfile<- srcfilecopy(filename, lines,
> />>/ file.mtime(filename)[1], *
> />/
> />/ * isFile = TRUE)*
> />/
> />/ * } *
> />/
> />/ *...*
> />/
> />/
> /I do not fully understand this weird behaviour, so I ask help of R Core
> developers to fix this annoying bug that prevents using unicode scripts
> with cyrillic on Windows.
> Maybe you should make that part of *source()* function read files like
> *parse()* function?
>
> *Session and encoding info:*
>
>> / > sessionInfo()
> />/ R version 3.4.1 (2017-06-30)
> />/ Platform: x86_64-w64-mingw32/x64 (64-bit)
> />/ Running under: Windows 7 x64 (build 7601) Service Pack 1
> />/ Matrix products: default
> />/ locale:
> />/ [1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
> />/ LC_MONETARY=Russian_Russia.1251
> />/ [4] LC_NUMERIC=C LC_TIME=Russian_Russia.1251
> />/ attached base packages:
> />/ [1] stats graphics grDevices utils datasets methods base
> />/ loaded via a namespace (and not attached):
> />/ [1] compiler_3.4.1 tools_3.4.1
> /
>
>
>> / > l10n_info()
> />/ $MBCS
> />/ [1] FALSE
> />/ $`UTF-8`
> />/ [1] FALSE
> />/ $`Latin-1`
> />/ [1] FALSE
> />/ $codepage
> />/ [1] 1251/
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list