[Rd] R-4.3 version list.files function could not work correctly in chinese

Wed Aug 16 09:42:09 CEST 2023

On 8/15/23 16:00, Tomas Kalibera wrote:
>
> On 8/15/23 09:04, Ivan Krylov wrote:
>> В Tue, 15 Aug 2023 08:38:11 +0200
>> Tomas Kalibera <tomas.kalibera using gmail.com> пишет:
>>
>>> As this was reported to be regression in 4.3, it is entirely possible
>>> this change came with a regression (though a bit surprising we didn't
>>> catch it earlier by testing), so it would be a great help if I could
>>> have the example and debug it.
>> Sorry, let me try to be more clear.
>>
>> The Windows filename length limit is 255(?) wide characters. The
>> WIN32_FIND_DATAA structure contains a 260-byte buffer for the filename
>> to be returned by FindFirstFileA()/FindNextFileA(). If a wide character
>> takes more than one byte to be represented in UTF-8, it may overflow
>> the 260 byte limit in the WIN32_FIND_DATAA structure despite being
>> below the 260 wide character limit. When such an overflow happens,
>> FindNextFile() returns FALSE with GetLastError() == ERROR_MORE_DATA,
>> which results in R_readdir() returning NULL and makes list_files() stop
>> before listing the rest of the directory.
>>
>> This is easier to make happen by accident with Chinese characters,
>> because they take three UTF-8 bytes per character.
>>
>> Take the ø (\uf8) letter. It takes two bytes to represent in UTF-8.
>> Create a file with a name consisting of this symbol repeated 140 times.
>> When you run list.files() on the resulting directory on Windows with a
>> UTF-8 locale, Windows tries to fit (0xc3 0xb8) times 140 into a
>> 260-byte buffer, which doesn't work. I'm afraid the only way to avoid
>> such a failure is to rewrite R_readdir using the wide character API and
>> convert the file names on the fly. (Just like mingw readdir() did in
>> the past?)
>>
>> stopifnot(.Platform$OS.type == 'windows', l10n_info()$`UTF-8`)
>> # any character for which nchar(enc2utf8(.), 'bytes') > 1 will do
>> # any number >260/2 should do
>> file.create(strrep('\uf8', 140))
>> list.files()
>>
>> Does this work? I don't have access to a UTF-8 Windows machine right
>> now.
>
> Thanks, yes, I can reproduce the problem. Some Windows functions 
> impose 260 wide characters limit, but other 260 bytes limit, so one 
> can create a file with a name too long to be found by FindNextFileA.
>
> In R 4.2, we used readdir() from mingw-w64, which itself used 
> findnext, which however had the same problem, it used a buffer of size 
> 260 bytes and from the code of mingw-w64 and the Windows 
> documentation, it should have behaved the same, it should have stopped 
> the search on such a long file name. However, in my use case, R 4.2.3 
> crashed inside findnext due to stack overrun, R 4.1.3 worked, but 
> clearly it would require a different use case to overrun this buffer 
> as it didn't use UTF-8. This suggests that findnext didn't have a 
> check for this and hence caused memory corruption, which can lead to a 
> crash or work by coincidence. Which could have been the case for the 
> user reporting this as a regression compared to R 4.2. But it is not a 
> regression, the problem existed for long.
>
> So, yes, we'd probably have to use wide variants of 
> FindNext/FindFirst. I'll fix.

Fixed in R-devel (84960). Please let me know if you see any problem with 
the fix.

Thanks,
Tomas

>
> Thanks for debugging this,
> Tomas
>
>
>