[Rd] R-4.3 version list.files function could not work correctly in chinese
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Aug 16 14:59:15 CEST 2023
On 8/16/23 13:11, yu gong wrote:
> a little more information for this issue.
> Search in MS website today , found doc about "Maximum Path Length
> Limitation", Maximum Path Length Limitation - Win32 apps | Microsoft
> Learn
> <https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry> .
> According the doc, need to do two things to avoid this issue on window
> 10 and latter:
> 1 edit registry or group policy set
> HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]
> "LongPathsEnabled"=dword:00000001
>
> 2 app manifest (R already done it)
These settings are for long paths (meaning a full path containing of
multiple elements separated by backslashes), more about that is also in
[1].
But the problem that Ivan reported (which is not clear whether it is the
same problem as the one reported originally on this thread), is about
the limit for a single file/directory name - that is, for a single
element of a path. Having the long paths enabled in the registry
wouldn't help with this.
These two limits are not directly related, except the obvious: by
choosing rather long names for individual files, one usually soon runs
out of the limit for the full path.
Best
Tomas
[1] -
https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html
>
> Regards,
> yu
>
> ------------------------------------------------------------------------
> *From:* R-devel <r-devel-bounces using r-project.org> on behalf of Tomas
> Kalibera <tomas.kalibera using gmail.com>
> *Sent:* Wednesday, August 16, 2023 15:42
> *To:* Ivan Krylov <krylov.r00t using gmail.com>
> *Cc:* r-devel using r-project.org <r-devel using r-project.org>
> *Subject:* Re: [Rd] R-4.3 version list.files function could not work
> correctly in chinese
>
> On 8/15/23 16:00, Tomas Kalibera wrote:
> >
> > On 8/15/23 09:04, Ivan Krylov wrote:
> >> В Tue, 15 Aug 2023 08:38:11 +0200
> >> Tomas Kalibera <tomas.kalibera using gmail.com> пишет:
> >>
> >>> As this was reported to be regression in 4.3, it is entirely possible
> >>> this change came with a regression (though a bit surprising we didn't
> >>> catch it earlier by testing), so it would be a great help if I could
> >>> have the example and debug it.
> >> Sorry, let me try to be more clear.
> >>
> >> The Windows filename length limit is 255(?) wide characters. The
> >> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
> >> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
> >> takes more than one byte to be represented in UTF-8, it may overflow
> >> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
> >> below the 260 wide character limit. When such an overflow happens,
> >> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
> >> which results in R_readdir() returning NULL and makes list_files() stop
> >> before listing the rest of the directory.
> >>
> >> This is easier to make happen by accident with Chinese characters,
> >> because they take three UTF-8 bytes per character.
> >>
> >> Take the ø (\uf8) letter. It takes two bytes to represent in UTF-8.
> >> Create a file with a name consisting of this symbol repeated 140 times.
> >> When you run list.files() on the resulting directory on Windows with a
> >> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
> >> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid
> >> such a failure is to rewrite R_readdir using the wide character API and
> >> convert the file names on the fly. (Just like mingw readdir() did in
> >> the past?)
> >>
> >> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
> >> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
> >> # any number >260/2 should do
> >> file.create(strrep('\uf8', 140))
> >> list.files()
> >>
> >> Does this work? I don't have access to a UTF-8 Windows machine right
> >> now.
> >
> > Thanks, yes, I can reproduce the problem. Some Windows functions
> > impose 260 wide characters limit, but other 260 bytes limit, so one
> > can create a file with a name too long to be found by FindNextFileA.
> >
> > In R 4.2, we used readdir() from mingw-w64, which itself used
> > findnext, which however had the same problem, it used a buffer of size
> > 260 bytes and from the code of mingw-w64 and the Windows
> > documentation, it should have behaved the same, it should have stopped
> > the search on such a long file name. However, in my use case, R 4.2.3
> > crashed inside findnext due to stack overrun, R 4.1.3 worked, but
> > clearly it would require a different use case to overrun this buffer
> > as it didn't use UTF-8. This suggests that findnext didn't have a
> > check for this and hence caused memory corruption, which can lead to a
> > crash or work by coincidence. Which could have been the case for the
> > user reporting this as a regression compared to R 4.2. But it is not a
> > regression, the problem existed for long.
> >
> > So, yes, we'd probably have to use wide variants of
> > FindNext/FindFirst. I'll fix.
>
> Fixed in R-devel (84960). Please let me know if you see any problem with
> the fix.
>
> Thanks,
> Tomas
>
> >
> > Thanks for debugging this,
> > Tomas
> >
> >
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> <https://stat.ethz.ch/mailman/listinfo/r-devel>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list