[Rd] `basename` and `dirname` change the encoding to "UTF-8"

Tue Jun 30 16:50:55 CEST 2020

Hello, everyone,

thank you for your quick and helpful responses and the detailed information.

Sorry for not providing a reproducible example for the (potential) bug in `tools::makeLazyLoadDB`.  The main point of my mail was the surprising behaviour of `basename` and `dirname`.  Fixing those functions would probably solve my problem for me (as a workaround, probably hiding some underlying problem, and likely leading to a failure for someone else fighting with encodings).

Concerning my underlying direct problem with `tools::makeLazyLoadDB`, I'm having difficulty to make my example reproducible.  I'm trying to use a directory with a non-ASCII-name for a knitr cache.  My R-4.0.0 here behaves different from my R-3.6.3, but when I filed a bug report with knitr, Yihui could not reproduce this difference (https://github.com/yihui/knitr/issues/1840).  So I'll try R-4.0.2 next, let's see what happens.

Cheers
Johannes

> Gesendet: Dienstag, 30. Juni 2020 um 09:25 Uhr
> Von: "Tomas Kalibera" <tomas.kalibera using gmail.com>
> An: "Johannes Rauh" <JARauh using web.de>, "r-devel" <r-devel using r-project.org>
> Betreff: Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"
>
> On 6/29/20 4:39 PM, Johannes Rauh wrote:
> > Dear R Developers,
> >
> > I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3):
> >
> >> p <- "Föö/Bär"
> >> Encoding(p)
> > [1] "latin1"
> >> Encoding(dirname(p))
> > [1] "UTF-8"
> >> Encoding(basename(p))
> > [1] "UTF-8"
> >
> > Is this on purpose?  At least I did not find any relevant comment in the documentation of `dirname`/`basename`.
> > Background: I'm currently struggeling with a directory name containing a latin1-character.  (I know that this is a bad idea, but I did not create the directory and I cannot rename it.)  I now want to pass a latin1-directory name to a function, which internally uses `tools::makeLazyLoadDB`.  At that point, internally, `dirname` is called, which changes the encoding, and things break.  If I use `debug` to halt the processing and "fix" the encoding, things work as expected.
> >
> > So, if possible, I would prefer that `dirname` and `basename` preserve the encoding.
> 
> Please try to always submit a minimal reproducible example with your 
> reports and test with at least the latest released version of R, ideally 
> also with R-devel.
> 
> As you have not sent a reproducible example, it is hard to tell for 
> sure, but most likely as Kevin wrote you have run into a real bug, which 
> was however already fixed in 4.0.2 and in R-devel (17833). The lazy 
> loading cache did not work with file names in non-native encoding.
> 
> That real bug has been uncovered by legitimate and correct changes like 
> the ones you report, where file operations started returning non-ASCII 
> strings in UTF-8. Historically in R such functions would instead return 
> native strings with misrepresented characters, and we were reluctant to 
> change that expecting waking bugs in code silently assuming native 
> encoding. Still, as people were increasingly running into problems with 
> non-representable characters, we did that change in several functions 
> anyway, and yes, it started waking up bugs.
> 
> With some performance overhead and added complexity, we could be 
> returning preferentially results in native encoding, and in UTF-8 only 
> when they included non-representable characters. That would increase the 
> code complexity, increase performance overhead, but wake up existing 
> bugs with smaller probability.  Note - some code that relied previously 
> on best-fit conversions done by Windows will have been broken anyway. We 
> would have to bypass win_iconv/iconv for that (adding more complexity). 
> Bugs in code not handling encodings properly would still be triggered 
> via non-representable characters. I've recently changed file.path() in 
> R-devel to be slightly more conservative again, along these lines.
> 
> We can still do it more widely, but it is not high on the priority list. 
> The way to fix all of these problems is switching to UTF-8 as native 
> encoding on Windows and every day spent on tuning the existing behavior 
> postpones that real solution.
> 
> Best
> Tomas
> 
> 
> >
> > Best regards
> > Johannes
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
>