[Rd] R-4.3 version list.files function could not work correctly in chinese

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Sat Aug 12 17:33:15 CEST 2023


Dear Yihui,

Thanks a lot for your help!

Unfortunately, I was not able to reproduce this. I've tried creating
files with Chinese characters in their names and populating them
with valid UTF-8 and valid non-UTF-8 text, but R seems to be able to
list them all in my case.

I'm running a US English evaluation ISO image of a slightly newer build
of Windows 10, and I also compiled R-4.3.1 from source, anticipating
having to single-step through the list.files() implementation:

sessionInfo()
# R version 4.3.1 (2023-06-16 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 19045)
# 
# Matrix products: default
# 
# 
# locale:
# [1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United
# States.utf8
# [3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
# [5] LC_TIME=English_United States.utf8
# 
# time zone: America/Los_Angeles
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.3.1
dir("测试文件")
# [1] "测试中文-non-utf8-ЪЪЪЪЪ.txt" "测试中文-utf-8.txt"         
system('cmd /c dir /s *.txt')
#  Volume in drive C has no label.
#  Volume Serial Number is A85A-AA74
# 
#  Directory of C:\R\R-4.3.1\bin\x64\????
# 
# 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
# 08/12/2023  07:56 AM                18 ????-utf-8.txt
#                2 File(s)             40 bytes
# 
#      Total Files Listed:
#                2 File(s)             40 bytes
#                0 Dir(s)  29,538,418,688 bytes free
# [1] 0

(The OEM codepage cannot represent the characters I used in the file
names, but all the files are present in both lists.)

In order to find out what's wrong, it will be needed to download the R
source code and compile it [*], install gdb using pacman (part of
Rtools), then set a breakpoint on the list_files function from
src/main/platform.c and step through it [**], paying attention to the
R_readdir calls. Do the missing file names not even come out from
FindNextFile()? Are they somehow skipped around the time of regex match?

(I could help with the details of this, maybe off-list, if there's
interest.)

Unless Tomas Kalibera is able to deduce the root cause from the
observed symptoms, someone who can reproduce the problem will have to
investigate further.

-- 
Best regards,
Ivan

[*] https://cran.r-project.org/bin/windows/base/howto-R-devel.html

[**] https://beej.us/guide/bggdb/



More information about the R-devel mailing list