[R] Encoding issue
Sebastien Bihorel
@eb@@t|en@b|hore| @end|ng |rom cogn|gencorp@com
Mon Nov 5 14:36:13 CET 2018
Hi,
I am having problems getting similar output when processing the same markdown files on 2 different Linux systems (one is a laptop with Linux Mint 18.3, the other is a production server running on CentOS 7). I think this boils down to an encoding issue but I am not sure if this is a system-wide issue or an R issue. So, this is what I have so far.
I have this very small dummy html file (with the same md5sum on both systems) which only contains 3 characters. A "od -cx" call provides the same output in both systems:
0000000 r 342 200 231 s \n
e272 9980 0a73
The middle character is some form of single quote produced by the conversion of a ' character from markdown to html. Reading the same file in both systems and applying a gsub replace provide widely different results.
####On my laptop
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "r’s"
> gsub('\\s{2,}', ' ', x)
[1] "r’s"
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.4
####On the server
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "râs"
> gsub('\\s{2,}', ' ', x)
[1] " "
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.3
(The overarching issue is that I have to use the production server for SOP reasons, so I cannot simply ignore the problem and use my laptop).
I would appreciate any suggestions on how to approach this issue.
More information about the R-help
mailing list