[Rd] R-4.3 version list.files function could not work correctly in chinese

yu gong yugong @end|ng |rom out|ook@com
Wed Aug 16 13:11:39 CEST 2023


a little more information for this issue.
Search in MS website today , found doc about "Maximum Path Length Limitation", Maximum Path Length Limitation - Win32 apps | Microsoft Learn<https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=registry> .
According the doc, need to do two things to avoid this issue on window 10  and latter:
1 edit registry or group policy  set    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem] "LongPathsEnabled"=dword:00000001

2 app manifest (R already done it)

Regards,
yu

________________________________
From: R-devel <r-devel-bounces using r-project.org> on behalf of Tomas Kalibera <tomas.kalibera using gmail.com>
Sent: Wednesday, August 16, 2023 15:42
To: Ivan Krylov <krylov.r00t using gmail.com>
Cc: r-devel using r-project.org <r-devel using r-project.org>
Subject: Re: [Rd] R-4.3 version list.files function could not work correctly in chinese


On 8/15/23 16:00, Tomas Kalibera wrote:
>
> On 8/15/23 09:04, Ivan Krylov wrote:
>> �� Tue, 15 Aug 2023 08:38:11 +0200
>> Tomas Kalibera <tomas.kalibera using gmail.com> ��ڬ�֬�:
>>
>>> As this was reported to be regression in 4.3, it is entirely possible
>>> this change came with a regression (though a bit surprising we didn't
>>> catch it earlier by testing), so it would be a great help if I could
>>> have the example and debug it.
>> Sorry, let me try to be more clear.
>>
>> The Windows filename length limit is 255(?) wide characters. The
>> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
>> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
>> takes more than one byte to be represented in UTF-8, it may overflow
>> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
>> below the 260 wide character limit. When such an overflow happens,
>> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
>> which results in R_readdir() returning NULL and makes list_files() stop
>> before listing the rest of the directory.
>>
>> This is easier to make happen by accident with Chinese characters,
>> because they take three UTF-8 bytes per character.
>>
>> Take the �� (\uf8) letter. It takes two bytes to represent in UTF-8.
>> Create a file with a name consisting of this symbol repeated 140 times.
>> When you run list.files() on the resulting directory on Windows with a
>> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
>> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid
>> such a failure is to rewrite R_readdir using the wide character API and
>> convert the file names on the fly. (Just like mingw readdir() did in
>> the past?)
>>
>> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
>> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
>> # any number >260/2 should do
>> file.create(strrep('\uf8', 140))
>> list.files()
>>
>> Does this work? I don't have access to a UTF-8 Windows machine right
>> now.
>
> Thanks, yes, I can reproduce the problem. Some Windows functions
> impose 260 wide characters limit, but other 260 bytes limit, so one
> can create a file with a name too long to be found by FindNextFileA.
>
> In R 4.2, we used readdir() from mingw-w64, which itself used
> findnext, which however had the same problem, it used a buffer of size
> 260 bytes and from the code of mingw-w64 and the Windows
> documentation, it should have behaved the same, it should have stopped
> the search on such a long file name. However, in my use case, R 4.2.3
> crashed inside findnext due to stack overrun, R 4.1.3 worked, but
> clearly it would require a different use case to overrun this buffer
> as it didn't use UTF-8. This suggests that findnext didn't have a
> check for this and hence caused memory corruption, which can lead to a
> crash or work by coincidence. Which could have been the case for the
> user reporting this as a regression compared to R 4.2. But it is not a
> regression, the problem existed for long.
>
> So, yes, we'd probably have to use wide variants of
> FindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem with
the fix.

Thanks,
Tomas

>
> Thanks for debugging this,
> Tomas
>
>
>

______________________________________________
R-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]



More information about the R-devel mailing list