[R-sig-Debian] invalid multibyte string at '<a0>'

Paul Johnson pauljohn32 at gmail.com
Tue Aug 16 19:53:46 CEST 2011


On Tue, Aug 16, 2011 at 9:30 AM, Matthieu Stigler
<matthieu.stigler at gmail.com> wrote:
> Thanks a lot Anne!
>
> I'm somehow glad to see the problem is confirmed... indeed one can
> manipulate the file, but at best I would have wished to have a more direct
> "R solution"... (there are quite many files).
>
> Anyone would have other suggestion?

 I suggest you clean up the most obvious sources of trouble--get rid
of spaces in the variable names, and put quotation marks around the
text values in the dataset where spaces are present.

I don't think the problem in that file is line endings. It is
multibyte characters. The output from "file" cited earlier is just
wrong, I think. There are definitely other characters in there.

I fiddled around with that file for a while. If you open it in Emacs,
you see lots of trouble signs. The last character on many of the lines
shows as a long underscore. It is not a regular underscore, it is a
multi-byte character.  If you replace that character by NA, then you
can open the file with read.table("prob.csv", header=T,sep=",").

So that makes me wonder what is special about your Windows system that
it will overlook these problems.

If you change that last symbol to NA, you will get quite a different
response from R.

Note it still can't make sense out of your variable names, but it is
not an outright error

> read.csv("prob2.csv", header=T, sep=",")
                                     Location  X Time.Period
1                                 Afghanistan NA        2004
2                                     Albania NA        2009
3                                     Algeria NA        2006
4                                      Angola NA        2001
5                         Antigua and Barbuda NA        2006
6                                   Argentina NA        2008
7                                     Armenia NA        2007
8                                   Australia NA        2000
9                                     Austria NA        2006
10                                 Azerbaijan NA        2006
11                                    Bahamas NA        2006
12                                 Bangladesh NA        2007
13                                   Barbados NA        2006


I don't think this is an R-sig-Debian question. This is a general R
questions, I'm certain that If you can get Brian Ripley's attention in
r-help, you are going to get the best information. He did the encoding
work in R2.0.

I'm pretty sure his first thought will be "did you read the manual?"

?read.table


fileEncoding: character string: if non-empty declares the encoding used
          on a file (not a connection) so the character data can be
          re-encoded.  See the ‘Encoding’ section of the help for
          ‘file’, the ‘R Data Import/Export Manual’ and ‘Note’.

encoding: encoding to be assumed for input strings.  It is used to mark
          character strings as known to be in Latin-1 or UTF-8 (see
          ‘Encoding’): it is not used to re-encode the input, but
          allows R to handle encoded strings in their native encoding
          (if one of those two).  See ‘Value’.

But if you demonstrate effort to understand ?Encodings, I expect he
will help you.


read.table("prob.csv", header=T, sep=",", fileEncoding="latin1")

I tried several variations, I expect you will find the file itself is
flawed, but also the encoding is obscuring the problem.

Here's a splat:

> dat <- read.table("prob.csv", header=T, sep=",", fileEncoding="latin1")
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns
>  head(dat)
             Location  X Time.Period Low.birth.weight.newborns....
1         Afghanistan NA        2004
2             Albania NA        2009
3             Algeria NA        2006                             6
4              Angola NA        2001
5 Antigua and Barbuda NA        2006                             5
6           Argentina NA        2008                             7
>


-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas



More information about the R-SIG-Debian mailing list