[Rd] readlines() truncates text file with Codepage 437 encoding
Martin Maechler
maechler at stat.math.ethz.ch
Wed Jun 8 10:50:49 CEST 2016
Appended is the file -- you need to tell your e-mail software to use
one of the MIME types that R-devel does accept; text/plain
is what I chose
((Yes, as R mailing list server "operator", with a bit of detective work,
I was able to find the "uncleaned" e-mail and extract the
attachment from it))
Martin Maechler
ETH Zurich
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 437__characters.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20160608/d0b88325/attachment.txt>
-------------- next part --------------
>>>>> Adam Obeng <adam.obeng at columbia.edu>
>>>>> on Mon, 6 Jun 2016 11:11:21 +0100 writes:
> Hello r-devel, The attached Code page 437-encoded file
> contains 245 characters (including the final newline), but
> readLines only reads 242 of them:
>> test_text <- readLines(file('437__characters.txt',
>> encoding='437'))
> Warning message: In readLines(file("437__characters.txt",
> : incomplete final line found on '437__characters.txt'
>> test_text
> [1]
> "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037
> !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177
> ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????"
>> nchar(test_text)
> [1] 242
> You'll note that readLines hasn't read the final
> characters "??\n".
> # Diagnostics
> My best guess is that this is something to do with how
> readLines() determines when it has reached EOF, because of
> the following:
> - The file is terminated with an ASCII LF (0x0a), but R
> gives an 'incomplete final line found' warning. Note that
> in some implementations of Code page 437, 0x0a is
> interpreted as a graphical character rather than a control
> character, but this does not seem to be the problem here.
> The same problem occurs if the file ends with 0x0d or 0x0d
> 0x0a. - Adding seven or more characters to the end of the
> file makes it read correctly - Similarly, the file is read
> correctly if you remove three characters from anywhere in
> the file - The same issue seems to occur with reading
> files encoded in other DOS code pages
> # Additional information
>> sessionInfo()
> R version 3.2.3 (2015-12-10) Platform:
> x86_64-apple-darwin14.5.0 (64-bit) Running under: OS X
> 10.10.5 (Yosemite)
> locale: [1]
> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
> The same behaviour occurs under R 2.15.1 on a Linux
> server.
> In case the attached file is somehow corrupted, here is a
> hexdump:
> 00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b
> ................ 00000010: 1c1d 1e1f 2021 2223 2425 2627
> 2829 2a2b .... !"#$%&'()*+ 00000020: 2c2d 2e2f 3031 3233
> 3435 3637 3839 3a3b ,-./0123456789:; 00000030: 3c3d 3e3f
> 4041 4243 4445 4647 4849 4a4b <=>?@ABCDEFGHIJK 00000040:
> 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[
> 00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b
> \]^_`abcdefghijk 00000060: 6c6d 6e6f 7071 7273 7475 7677
> 7879 7a7b lmnopqrstuvwxyz{ 00000070: 7c7d 7e7f ffad 9b9c
> 9da6 aeaa f8f1 fde6 |}~............. 00000080: faa7 afac
> aba8 8e8f 9280 90a5 999a e185 ................ 00000090:
> a083 8486 9187 8a82 8889 8da1 8c8b a495 ................
> 000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0
> ................ 000000b0: ebee e3e5 e7ed fc9e f9fb ecef
> f7f0 f3f2 ................ 000000c0: a9f4 f5c4 b3da bfc0
> d9c3 b4c2 c1c5 cdba ................ 000000d0: d5d6 c9b8
> b7bb d4d3 c8be bdbc c6c7 ccb5 ................ 000000e0:
> b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................
> 000000f0: b0b1 b2fe 0a .....
> Has anyone encountered something similar?
> Kind regards,
> Adam Obeng ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list