[Rd] NEWS item for bugfix in normalizePath and file.exists?

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Apr 28 17:32:04 CEST 2021


Hi Toby,

On 4/28/21 4:21 PM, Toby Hocking wrote:
> Hi Tomas, thanks for the thoughtful reply. That makes sense about the 
> problems with C locale on windows. Actually I did not choose to use C 
> locale, but instead it was invoked automatically during a package check.

I see, as long as the tests only have ASCII strings, the encoding does 
not matter, but once there are also other characters, I think we should 
be running with some real encoding, and one where the characters can be 
represented.

Best,
Tomas

> To be clear, I do NOT have a file with that name, but I do want 
> file.exists to return a reasonable value, FALSE (with no error). If 
> that behavior is unspecified, then should I use something like 
> tryCatch(file.exists(x), error=function(e)FALSE) instead of assuming 
> that file.exists will always return a logical vector without error? 
> For my particular application that work-around should probably be 
> sufficient, but one may imagine a situation where you want to do
>
> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| 
> \360\237\247\222\360\237\217\274\n| 
> \360\237\247\222\360\237\217\275\n| 
> \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"

> Sys.setlocale(locale="C")
> f <- tempfile()
> cat("", file = f)
> two <- c(x, f)
> file.exists(two)
>
> and in that case the correct response from R, in my opinion, would be 
> c(FALSE, TRUE) -- not an error.


> Toby
>
> On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
> <tomas.kalibera using gmail.com <mailto:tomas.kalibera using gmail.com>> wrote:
>
>     Hi Toby,
>
>     a defensive, portable approach would be to use only file names
>     regarded
>     portable by POSIX, so characters including ASCII letters, digits,
>     underscore, dot, hyphen (but hyphen should not be the first
>     character).
>     That would always work on all systems and this is what I would use.
>
>     Individual operating systems and file systems and their
>     configurations
>     differ in which additional characters they support and how. On some,
>     file names are just sequences of bytes, on some, they have to be
>     valid
>     strings in certain encoding (and then with certain exceptions).
>
>     On Windows, file names are at the lowest level in UTF-16LE
>     encoding (and
>     admitting unpaired surrogates for historical reasons). R stores
>     strings
>     in other encodings (UTF-8, native, Latin-1), so file names have to be
>     translated to/from UTF-16LE, either directly by R or by Windows.
>
>     But, there is no way to convert (non-ASCII) strings in "C"
>     encoding to
>     UTF16-LE, so the examples cannot be made to work on Windows.
>
>     When the translation is left on Windows, it assumes the non-UTF-16LE
>     strings are in the Active Code Page encoding (shown as "system
>     encoding"
>     in sessionInfo() in R, Latin-1 in your example) instead of the
>     current C
>     library encoding ("C" in your example). So, file names coming from
>     Windows will be either the bytes of their UTF-16LE representation
>     or the
>     bytes of their Latin-1 representation, but which one is subject to
>     the
>     implementation details, so the result is really unusable.
>
>     I would say using "C" as encoding in R is not a good idea, and
>     particularly not on Windows.
>
>     I would say that what happens with such file names in "C" encoding is
>     unspecified behavior, which is subject to change at any time without
>     notice, and that both the R 4.0.5 and R-devel behavior you are
>     observing
>     are acceptable. I don't think it should be mentioned in the NEWS.
>     Personally, I would prefer some stricter checks of strings
>     validity and
>     perhaps disallowing the "C" encoding in R, so yet another behavior
>     where
>     it would be clearer that this cannot really work, but that would
>     require
>     more thought and effort.
>
>     Best
>     Tomas
>
>
>     On 4/27/21 9:53 PM, Toby Hocking wrote:
>
>     > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
>     fixed in
>     > R-devel already. I checked on
>     > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS
>     <https://developer.r-project.org/blosxom.cgi/R-devel/NEWS> and
>     there is no
>     > mention of these changes, so I'm wondering if they are
>     intentional? If so,
>     > could someone please add a mention of the bugfix in the NEWS?
>     >
>     > The problem involves file.exists, on windows, when a
>     long/strange input
>     > file name Encoding is unknown, in C locale. I expected that
>     FALSE should be
>     > returned (and it is on R-devel), but I got an error in R-4.0.5.
>     Code to
>     > reproduce is:
>     >
>     > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
>     > \360\237\247\222\360\237\217\274\n|
>     \360\237\247\222\360\237\217\275\n|
>     > \360\237\247\222\360\237\217\276\n|
>     \360\237\247\222\360\237\217\277\n"
>     > Encoding(x) <- "unknown"
>     > Sys.setlocale(locale="C")
>     > sessionInfo()
>     > file.exists(x)
>     >
>     > Output I got from R-4.0.5 was
>     >
>     >> sessionInfo()
>     > R version 4.0.5 (2021-03-31)
>     > Platform: x86_64-w64-mingw32/x64 (64-bit)
>     > Running under: Windows 10 x64 (build 19042)
>     >
>     > Matrix products: default
>     >
>     > locale:
>     > [1] C
>     > system code page: 1252
>     >
>     > attached base packages:
>     > [1] stats     graphics  grDevices utils     datasets methods   base
>     >
>     > loaded via a namespace (and not attached):
>     > [1] compiler_4.0.5
>     >> file.exists(x)
>     > Error in file.exists(x) : file name conversion problem -- name
>     too long?
>     > Execution halted
>     >
>     > Output I got from R-devel was
>     >
>     >> sessionInfo()
>     > R Under development (unstable) (2021-04-26 r80229)
>     > Platform: x86_64-w64-mingw32/x64 (64-bit)
>     > Running under: Windows 10 x64 (build 19042)
>     >
>     > Matrix products: default
>     >
>     > locale:
>     > [1] C
>     >
>     > attached base packages:
>     > [1] stats     graphics  grDevices utils     datasets methods   base
>     >
>     > loaded via a namespace (and not attached):
>     > [1] compiler_4.2.0
>     >> file.exists(x)
>     > [1] FALSE
>     >
>     > I also observed similar results when using normalizePath instead of
>     > file.exists (error in R-4.0.5, no error in R-devel).
>     >
>     >> normalizePath(x) #R-4.0.5
>     > Error in path.expand(path) : unable to translate 'p'
>     > | p'p;
>     > | p'p<
>     > | p'p=
>     > | p'p>
>     > | p'p<bf>
>     > ' to UTF-8
>     > Calls: normalizePath -> path.expand
>     > Execution halted
>     >
>     >> normalizePath(x) #R-devel
>     > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
>     > \360\237\247\222\360\237\217\273\n|
>     \360\237\247\222\360\237\217\274\n|
>     > \360\237\247\222\360\237\217\275\n|
>     \360\237\247\222\360\237\217\276\n|
>     > \360\237\247\222\360\237\217\277\n"
>     > Warning message:
>     > In normalizePath(path.expand(path), winslash, mustWork) :
>     path[1]="🧒
>     > | 🧒🏻
>     > | 🧒🏼
>     > | 🧒🏽
>     > | 🧒🏾
>     > | 🧒🏿
>     > ": The filename, directory name, or volume label syntax is incorrect
>     >
>     >       [[alternative HTML version deleted]]
>     >
>     > ______________________________________________
>     > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     <https://stat.ethz.ch/mailman/listinfo/r-devel>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list