[Rd] `basename` and `dirname` change the encoding to "UTF-8"
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Jun 30 09:25:27 CEST 2020
On 6/29/20 4:39 PM, Johannes Rauh wrote:
> Dear R Developers,
>
> I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3):
>
>> p <- "Föö/Bär"
>> Encoding(p)
> [1] "latin1"
>> Encoding(dirname(p))
> [1] "UTF-8"
>> Encoding(basename(p))
> [1] "UTF-8"
>
> Is this on purpose? At least I did not find any relevant comment in the documentation of `dirname`/`basename`.
> Background: I'm currently struggeling with a directory name containing a latin1-character. (I know that this is a bad idea, but I did not create the directory and I cannot rename it.) I now want to pass a latin1-directory name to a function, which internally uses `tools::makeLazyLoadDB`. At that point, internally, `dirname` is called, which changes the encoding, and things break. If I use `debug` to halt the processing and "fix" the encoding, things work as expected.
>
> So, if possible, I would prefer that `dirname` and `basename` preserve the encoding.
Please try to always submit a minimal reproducible example with your
reports and test with at least the latest released version of R, ideally
also with R-devel.
As you have not sent a reproducible example, it is hard to tell for
sure, but most likely as Kevin wrote you have run into a real bug, which
was however already fixed in 4.0.2 and in R-devel (17833). The lazy
loading cache did not work with file names in non-native encoding.
That real bug has been uncovered by legitimate and correct changes like
the ones you report, where file operations started returning non-ASCII
strings in UTF-8. Historically in R such functions would instead return
native strings with misrepresented characters, and we were reluctant to
change that expecting waking bugs in code silently assuming native
encoding. Still, as people were increasingly running into problems with
non-representable characters, we did that change in several functions
anyway, and yes, it started waking up bugs.
With some performance overhead and added complexity, we could be
returning preferentially results in native encoding, and in UTF-8 only
when they included non-representable characters. That would increase the
code complexity, increase performance overhead, but wake up existing
bugs with smaller probability. Note - some code that relied previously
on best-fit conversions done by Windows will have been broken anyway. We
would have to bypass win_iconv/iconv for that (adding more complexity).
Bugs in code not handling encodings properly would still be triggered
via non-representable characters. I've recently changed file.path() in
R-devel to be slightly more conservative again, along these lines.
We can still do it more widely, but it is not high on the priority list.
The way to fix all of these problems is switching to UTF-8 as native
encoding on Windows and every day spent on tuning the existing behavior
postpones that real solution.
Best
Tomas
>
> Best regards
> Johannes
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list