[Rd] `basename` and `dirname` change the encoding to "UTF-8"
Johannes Rauh
JAR@uh @end|ng |rom web@de
Tue Jun 30 16:50:55 CEST 2020
Hello, everyone,
thank you for your quick and helpful responses and the detailed information.
Sorry for not providing a reproducible example for the (potential) bug in `tools::makeLazyLoadDB`. The main point of my mail was the surprising behaviour of `basename` and `dirname`. Fixing those functions would probably solve my problem for me (as a workaround, probably hiding some underlying problem, and likely leading to a failure for someone else fighting with encodings).
Concerning my underlying direct problem with `tools::makeLazyLoadDB`, I'm having difficulty to make my example reproducible. I'm trying to use a directory with a non-ASCII-name for a knitr cache. My R-4.0.0 here behaves different from my R-3.6.3, but when I filed a bug report with knitr, Yihui could not reproduce this difference (https://github.com/yihui/knitr/issues/1840). So I'll try R-4.0.2 next, let's see what happens.
Cheers
Johannes
> Gesendet: Dienstag, 30. Juni 2020 um 09:25 Uhr
> Von: "Tomas Kalibera" <tomas.kalibera using gmail.com>
> An: "Johannes Rauh" <JARauh using web.de>, "r-devel" <r-devel using r-project.org>
> Betreff: Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"
>
> On 6/29/20 4:39 PM, Johannes Rauh wrote:
> > Dear R Developers,
> >
> > I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3):
> >
> >> p <- "Föö/Bär"
> >> Encoding(p)
> > [1] "latin1"
> >> Encoding(dirname(p))
> > [1] "UTF-8"
> >> Encoding(basename(p))
> > [1] "UTF-8"
> >
> > Is this on purpose? At least I did not find any relevant comment in the documentation of `dirname`/`basename`.
> > Background: I'm currently struggeling with a directory name containing a latin1-character. (I know that this is a bad idea, but I did not create the directory and I cannot rename it.) I now want to pass a latin1-directory name to a function, which internally uses `tools::makeLazyLoadDB`. At that point, internally, `dirname` is called, which changes the encoding, and things break. If I use `debug` to halt the processing and "fix" the encoding, things work as expected.
> >
> > So, if possible, I would prefer that `dirname` and `basename` preserve the encoding.
>
> Please try to always submit a minimal reproducible example with your
> reports and test with at least the latest released version of R, ideally
> also with R-devel.
>
> As you have not sent a reproducible example, it is hard to tell for
> sure, but most likely as Kevin wrote you have run into a real bug, which
> was however already fixed in 4.0.2 and in R-devel (17833). The lazy
> loading cache did not work with file names in non-native encoding.
>
> That real bug has been uncovered by legitimate and correct changes like
> the ones you report, where file operations started returning non-ASCII
> strings in UTF-8. Historically in R such functions would instead return
> native strings with misrepresented characters, and we were reluctant to
> change that expecting waking bugs in code silently assuming native
> encoding. Still, as people were increasingly running into problems with
> non-representable characters, we did that change in several functions
> anyway, and yes, it started waking up bugs.
>
> With some performance overhead and added complexity, we could be
> returning preferentially results in native encoding, and in UTF-8 only
> when they included non-representable characters. That would increase the
> code complexity, increase performance overhead, but wake up existing
> bugs with smaller probability. Note - some code that relied previously
> on best-fit conversions done by Windows will have been broken anyway. We
> would have to bypass win_iconv/iconv for that (adding more complexity).
> Bugs in code not handling encodings properly would still be triggered
> via non-representable characters. I've recently changed file.path() in
> R-devel to be slightly more conservative again, along these lines.
>
> We can still do it more widely, but it is not high on the priority list.
> The way to fix all of these problems is switching to UTF-8 as native
> encoding on Windows and every day spent on tuning the existing behavior
> postpones that real solution.
>
> Best
> Tomas
>
>
> >
> > Best regards
> > Johannes
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
More information about the R-devel
mailing list