[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Tomas Kalibera tomas.kalibera at gmail.com
Mon Apr 9 10:00:12 CEST 2018


Hi Vladimir,

thanks for your report - this was really a bug, now fixed in R-devel and 
to appear in 3.5.0.

Apart from the bug, having source files in UTF-8 and reading them into R 
on Windows is perfectly fine, you just need to specify that they are in 
UTF-8. You also need to make sure R is running in Russian locale 
(CP1251) if that is not the default. On my system, this works fine

Sys.setlocale(locale="Russian")
source("russian_utf8.R", encoding="UTF-8")

Best
Tomas


On 08/28/2017 11:27 AM, Владимир Панфилов wrote:
> Hello,
>
> I do not have an account on R Bugzilla, so I will post my bug report here.
> I want to report a very old bug in base R *source()* function. It relates
> to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
> reason if the UTF-8 script is containing cyrillic letter *"я"*, the script
> execution is interrupted directly on this letter (btw the same scripts are
> sourcing fine when they are encoded in the systems CP1251 encoding).
>
> Let's consider the following script that prints random russian words:
>
>
>>
>> *print("Осень")print("Ёжик")print("трясина")print("тест")*
>
> When this script is sourced we get INCOMPLETE_STRING error:
>
>
>>
>>
>>
>> *source('D:/R code/test_cyr_letter.R', encoding = 'UTF-8', echo=TRUE)Error
>> in source("D:/R code/test_cyr_letter.R", encoding = "UTF-8", echo = TRUE)
>> :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
>> print("Ёжик")3: print("тр         ^*
>
> Note that this bug is not triggered when the same file is executed using
> *eval(parse(...))*:
>
>
>>
>>
>> *> eval(parse('D:/R code/test_cyr_letter.R', encoding="UTF-8"))[1]
>> "Осень"[1] "Ёжик"[1] "трясина"[1] "тест"*
>
> I made some reserach and noticed that *source* and *parse* functions have
> similar parts of code for reading files. After analyzing code of *source()*
> function I found out that commenting one line from it fixes this bug and
> the overrided function works fine. See this part of *source()* function
> code:
>
> *... *
>> *filename <- file*
>>
>> *        file <- file(filename, "r")*
>>
>> *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
>>
>> *        if (isTRUE(keep.source)) {*
>>
>> *          lines <- scan(file, what="character", encoding = encoding, sep
>>> = "\n")*
>> *          on.exit()*
>>
>> *          close(file)*
>>
>> *          srcfile <- srcfilecopy(filename, lines,
>>> file.mtime(filename)[1], *
>> *                                 isFile = TRUE)*
>>
>> *        } *
>>
>> *...*
>>
>>
> I do not fully understand this weird behaviour, so I ask help of R Core
> developers to fix this annoying bug that prevents using unicode scripts
> with cyrillic on Windows.
> Maybe you should make that part of *source()* function read files like
> *parse()* function?
>
> *Session and encoding info:*
>
>>> sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> Running under: Windows 7 x64 (build 7601) Service Pack 1
>> Matrix products: default
>> locale:
>> [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
>>   LC_MONETARY=Russian_Russia.1251
>> [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.1 tools_3.4.1
>
>
>>> l10n_info()
>> $MBCS
>> [1] FALSE
>> $`UTF-8`
>> [1] FALSE
>> $`Latin-1`
>> [1] FALSE
>> $codepage
>> [1] 1251
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list