[Rd] NEWS item for bugfix in normalizePath and file.exists?
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Apr 28 17:32:04 CEST 2021
Hi Toby,
On 4/28/21 4:21 PM, Toby Hocking wrote:
> Hi Tomas, thanks for the thoughtful reply. That makes sense about the
> problems with C locale on windows. Actually I did not choose to use C
> locale, but instead it was invoked automatically during a package check.
I see, as long as the tests only have ASCII strings, the encoding does
not matter, but once there are also other characters, I think we should
be running with some real encoding, and one where the characters can be
represented.
Best,
Tomas
> To be clear, I do NOT have a file with that name, but I do want
> file.exists to return a reasonable value, FALSE (with no error). If
> that behavior is unspecified, then should I use something like
> tryCatch(file.exists(x), error=function(e)FALSE) instead of assuming
> that file.exists will always return a logical vector without error?
> For my particular application that work-around should probably be
> sufficient, but one may imagine a situation where you want to do
>
> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"
> Sys.setlocale(locale="C")
> f <- tempfile()
> cat("", file = f)
> two <- c(x, f)
> file.exists(two)
>
> and in that case the correct response from R, in my opinion, would be
> c(FALSE, TRUE) -- not an error.
> Toby
>
> On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera
> <tomas.kalibera using gmail.com <mailto:tomas.kalibera using gmail.com>> wrote:
>
> Hi Toby,
>
> a defensive, portable approach would be to use only file names
> regarded
> portable by POSIX, so characters including ASCII letters, digits,
> underscore, dot, hyphen (but hyphen should not be the first
> character).
> That would always work on all systems and this is what I would use.
>
> Individual operating systems and file systems and their
> configurations
> differ in which additional characters they support and how. On some,
> file names are just sequences of bytes, on some, they have to be
> valid
> strings in certain encoding (and then with certain exceptions).
>
> On Windows, file names are at the lowest level in UTF-16LE
> encoding (and
> admitting unpaired surrogates for historical reasons). R stores
> strings
> in other encodings (UTF-8, native, Latin-1), so file names have to be
> translated to/from UTF-16LE, either directly by R or by Windows.
>
> But, there is no way to convert (non-ASCII) strings in "C"
> encoding to
> UTF16-LE, so the examples cannot be made to work on Windows.
>
> When the translation is left on Windows, it assumes the non-UTF-16LE
> strings are in the Active Code Page encoding (shown as "system
> encoding"
> in sessionInfo() in R, Latin-1 in your example) instead of the
> current C
> library encoding ("C" in your example). So, file names coming from
> Windows will be either the bytes of their UTF-16LE representation
> or the
> bytes of their Latin-1 representation, but which one is subject to
> the
> implementation details, so the result is really unusable.
>
> I would say using "C" as encoding in R is not a good idea, and
> particularly not on Windows.
>
> I would say that what happens with such file names in "C" encoding is
> unspecified behavior, which is subject to change at any time without
> notice, and that both the R 4.0.5 and R-devel behavior you are
> observing
> are acceptable. I don't think it should be mentioned in the NEWS.
> Personally, I would prefer some stricter checks of strings
> validity and
> perhaps disallowing the "C" encoding in R, so yet another behavior
> where
> it would be clearer that this cannot really work, but that would
> require
> more thought and effort.
>
> Best
> Tomas
>
>
> On 4/27/21 9:53 PM, Toby Hocking wrote:
>
> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
> fixed in
> > R-devel already. I checked on
> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS
> <https://developer.r-project.org/blosxom.cgi/R-devel/NEWS> and
> there is no
> > mention of these changes, so I'm wondering if they are
> intentional? If so,
> > could someone please add a mention of the bugfix in the NEWS?
> >
> > The problem involves file.exists, on windows, when a
> long/strange input
> > file name Encoding is unknown, in C locale. I expected that
> FALSE should be
> > returned (and it is on R-devel), but I got an error in R-4.0.5.
> Code to
> > reproduce is:
> >
> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> > Encoding(x) <- "unknown"
> > Sys.setlocale(locale="C")
> > sessionInfo()
> > file.exists(x)
> >
> > Output I got from R-4.0.5 was
> >
> >> sessionInfo()
> > R version 4.0.5 (2021-03-31)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 19042)
> >
> > Matrix products: default
> >
> > locale:
> > [1] C
> > system code page: 1252
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.0.5
> >> file.exists(x)
> > Error in file.exists(x) : file name conversion problem -- name
> too long?
> > Execution halted
> >
> > Output I got from R-devel was
> >
> >> sessionInfo()
> > R Under development (unstable) (2021-04-26 r80229)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 19042)
> >
> > Matrix products: default
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.2.0
> >> file.exists(x)
> > [1] FALSE
> >
> > I also observed similar results when using normalizePath instead of
> > file.exists (error in R-4.0.5, no error in R-devel).
> >
> >> normalizePath(x) #R-4.0.5
> > Error in path.expand(path) : unable to translate 'p'
> > | p'p;
> > | p'p<
> > | p'p=
> > | p'p>
> > | p'p<bf>
> > ' to UTF-8
> > Calls: normalizePath -> path.expand
> > Execution halted
> >
> >> normalizePath(x) #R-devel
> > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
> > \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n|
> > \360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n|
> > \360\237\247\222\360\237\217\277\n"
> > Warning message:
> > In normalizePath(path.expand(path), winslash, mustWork) :
> path[1]="🧒
> > | 🧒🏻
> > | 🧒🏼
> > | 🧒🏽
> > | 🧒🏾
> > | 🧒🏿
> > ": The filename, directory name, or volume label syntax is incorrect
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> <https://stat.ethz.ch/mailman/listinfo/r-devel>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list