[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Mon Apr 9 10:42:25 CEST 2018

Hi Patrick,

thanks for your comments on the bug, just to clarify - one could 
reproduce the bug simply using file() and readLines(). The parser saw a 
real end of file as (incorrectly) communicated to it by lower level 
connections code - there is no design issue related in the parser (nor 
elsewhere), it was a bug in connections code and is now fixed.

You can specify source encoding in "file()" or "source()" to tell R that 
the source file is in that given encoding. R will convert the file 
contents to the current native encoding of the R session. If in doubt, 
please check the documentation ?file, ?source, ?readLines, ?Encoding for 
the details.

The observation that "я" is represented as 0xff (-1 as signed char) and 
R_EOF/EOF is -1 (but integer) was related to the bug, well spotted.

Best
Tomas

On 08/28/2017 02:24 PM, Patrick Perry wrote:
> My understanding (which could be wrong) is that when you source a file,
> it first gets translated to your native locale and then parsed. When you
> parse a character vector, it does not get translated.
>
> In your locale, every "я" character  (U+044F) gets replaced by the byte
> "\xFF":
>
>>   iconv("\u044f", "UTF-8", "Windows-1251")
> [1] "\xff"
>
> I suspect that particular value causes trouble for the R parser, which
> uses a stack of previously-seen characters (include/Defn.h):
>
> LibExtern char    R_ParseContext[PARSE_CONTEXT_SIZE] INI_as("");
>
> And at various places checks whether the context character is EOF. That
> character is defined as
>
> #define R_EOF    -1
>
> Which, when cast to a char, is 0xFF.
>
> I suspect that your example is revealing two bugs:
>
> 1) The R parser seems to have trouble with native characters encoded as
> 0xFF. It's possible that, since R strings can't contain 0x00, this can
> be fixed by changing the definition of R_EOF to
>
> #define R_EOF     0
>
>
> 2) The other bug is that, as I understand the situation, "source" will
> fail if the file contains a character that cannot be represented in your
> native locale. This is a harder bug to tackle because of the way file()
> and the other connection methods are designed, where they translate the
> input to your native locale. I don't know if it's possible to override
> this behavior, and have them translate input to UTF-8 instead.
>
>
>
> Patrick
>
>
> ---
>
> On Mon Aug 28 11:27:07 CEST 2017 Владимир Панфилов
> <vladimirpanfilov at gmail.com> wrote:
>
> Hello,
>
> I do not have an account on R Bugzilla, so I will post my bug report here.
> I want to report a very old bug in base R *source()* function. It relates
> to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
> reason if the UTF-8 script is containing cyrillic letter *"я"*, the script
> execution is interrupted directly on this letter (btw the same scripts are
> sourcing fine when they are encoded in the systems CP1251 encoding).
>
> Let's consider the following script that prints random russian words:
>
>
>> /
> />/
> />/  *print("Осень")print("Ёжик")print("трясина")print("тест")*
> /
>
> When this script is sourced we get INCOMPLETE_STRING error:
>
>
>> /
> />/
> />/
> />/
> />/  *source('D:/R code/test_cyr_letter.R', encoding = 'UTF-8', echo=TRUE)Error
> />/  in source("D:/R code/test_cyr_letter.R", encoding = "UTF-8", echo = TRUE)
> />/  :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
> />/  print("Ёжик")3: print("тр         ^*
> /
>
> Note that this bug is not triggered when the same file is executed using
> *eval(parse(...))*:
>
>
>> /
> />/
> />/
> />/  *>  eval(parse('D:/R code/test_cyr_letter.R', encoding="UTF-8"))[1]
> />/  "Осень"[1] "Ёжик"[1] "трясина"[1] "тест"*
> /
>
> I made some reserach and noticed that *source* and *parse* functions have
> similar parts of code for reading files. After analyzing code of *source()*
> function I found out that commenting one line from it fixes this bug and
> the overrided function works fine. See this part of *source()* function
> code:
>
> *... *
>> /
> />/  *filename<- file*
> />/
> />/  *        file<- file(filename, "r")*
> />/
> />/  *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
> />/
> />/  *        if (isTRUE(keep.source)) {*
> />/
> />/  *          lines<- scan(file, what="character", encoding = encoding, sep
> />>/  = "\n")*
> />/
> />/  *          on.exit()*
> />/
> />/  *          close(file)*
> />/
> />/  *          srcfile<- srcfilecopy(filename, lines,
> />>/  file.mtime(filename)[1], *
> />/
> />/  *                                 isFile = TRUE)*
> />/
> />/  *        } *
> />/
> />/  *...*
> />/
> />/
> /I do not fully understand this weird behaviour, so I ask help of R Core
> developers to fix this annoying bug that prevents using unicode scripts
> with cyrillic on Windows.
> Maybe you should make that part of *source()* function read files like
> *parse()* function?
>
> *Session and encoding info:*
>
>> /  >  sessionInfo()
> />/  R version 3.4.1 (2017-06-30)
> />/  Platform: x86_64-w64-mingw32/x64 (64-bit)
> />/  Running under: Windows 7 x64 (build 7601) Service Pack 1
> />/  Matrix products: default
> />/  locale:
> />/  [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
> />/   LC_MONETARY=Russian_Russia.1251
> />/  [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
> />/  attached base packages:
> />/  [1] stats     graphics  grDevices utils     datasets  methods   base
> />/  loaded via a namespace (and not attached):
> />/  [1] compiler_3.4.1 tools_3.4.1
> /
>
>
>> /  >  l10n_info()
> />/  $MBCS
> />/  [1] FALSE
> />/  $`UTF-8`
> />/  [1] FALSE
> />/  $`Latin-1`
> />/  [1] FALSE
> />/  $codepage
> />/  [1] 1251/
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel