[R] Potential R bug in identical

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Thu Jan 17 21:40:32 CET 2019


On Thu, 17 Jan 2019 14:55:18 +0000
Layik Hama <L.Hama using leeds.ac.uk> wrote:

> There seems to be some weird and unidentifiable (to me) characters in
> front of the `Accidents_Index` column name there causing the length
> to be 17 rather than 14 characters.

Repeating the reproduction steps described at the linked pull request,

$ curl -o acc2017.zip
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
$ unzip acc2017.zip
$ head -n 1 Acc.csv | hd | head -n 2
00000000  ef bb bf 41 63 63 69 64  65 6e 74 5f 49 6e 64 65  |...Accident_Inde|
00000010  78 2c 4c 6f 63 61 74 69  6f 6e 5f 45 61 73 74 69  |x,Location_Easti|

The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8.
Not sure which encoding R chooses on Windows by default, but
explicitly passing encoding="UTF-8" (or is it fileEncoding?) might
help decode it as such. (Sorry, cannot test my advice on Windows right
now.)

-- 
Best regards,
Ivan



More information about the R-help mailing list