[R] Strange characters that block import

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Oct 14 17:02:38 CEST 2009


On Wed, 14 Oct 2009, Duncan Murdoch wrote:

> On 10/14/2009 8:25 AM, arnaud Mosnier wrote:
>> Dear useRs,
>> 
>> I try to import a text file that contain some strange characters coming 
>> from
>> the misinterpretation of foreign language characters by another software
>> (see below).
>> 
>> ----------------------------------------
>> Here is an example of text with a line containing characters that bug the
>> import
>>
>> name;number
>> zdsfbg;2
>>  ;3
>> dtryjh;4
>> 
>> ----------------------------------------
>> 
>> R do not want to import lines after those strange characters (i.e. import
>> only the first two lines, one is the header, the second the first line of
>> data).
>> 
>> I already try to import using other encoding such as latin1 or UTF-8 but it
>> does not solve the problem.

If these are control characters (that is ^Z is Ctrl-Z, but we've no 
real information) then those are the same in every encoding that uses 
bytes (or at least those known to iconv).

>> Replacing those character in a text editor before importing solve the
>> solution, but I want that the user of my script do not have to edit the 
>> text
>> before the analysis in R.
>> 
>> Any hint ??
>
> Those funny characters are octal 032, Ctrl-Z.  Years ago that was defined on 
> DOS/Windows as an end of file marker, and I guess our code still honours 
> that.

More to the point, the Windows C run-time does (AFAIK Ctrl-Z is still 
current as EOF under Windows, and Wikipedia says so too), but nothing 
in the original posting mentioned this was on Windows, and ctrl-Z has 
no effect on the two other OSes I tried which read such a file 
successfully.

So without a single piece of the 'at a minimum' information requested 
in the posting guide, we are guessing (and I am guessing your example 
was done under Windows, too).

> You can work around it by stating that you're reading from a binary file, not 
> a text file:
>
> f <- file("text.txt", "rb")
>
> Then read.csv2(f) fails, but readLines(f) succeeds, so this works:
>
>> f <- file("c:/temp/test.txt", "rb")
>> read.csv2(textConnection(readLines(f)))
>               name number
> 1            zdsfbg      2
> 2 \032\032 \032\032      3
> 3            dtryjh      4
>
>> close(f)
>
> I don't know if there are any characters that would cause readLines to fail, 
> but there might be, so I'd suggest replacing the buggy software that caused 
> all the problems in the first place.

This is all a function of the OS's C runtime: I suspect Ctrl-D (eot) 
is interpreted as end-of-file on some OSes.  Nul (\0) will terminate 
strings (that's standard in C, and enforced in recent versions of R).

> Duncan Murdoch

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list