[R] Multibyte characters in (row) names

Richard R. Liu richard.liu at pueo-owl.ch
Tue Aug 3 15:56:08 CEST 2010


David, 


Thanks.  It turns out that, once you've set up your locales properly, it's
almost impossible to create an example for the problem.


I'm working with scientific text which contains a fair amount of symbols:
 degrees, plus-or-minus, etc.  When I read the text in, I specified UTF-8.  My
locale is UTF-8.  Everything would have been alright had wordStem, the stemmer
in Rstem, properly processed UTF-8.  In fact, though, it did not.  It apparently
broke up the bytes in the multibyte code for these symbols, so I usually ended
up with \xc2 or \xc3.


So the problem is clear, and I will circumvent it by removing symbols before
stemming.


Regards,
Richard

On August 2, 2010 at 7:13 PM David Winsemius <dwinsemius at comcast.net> wrote:

>
> On Aug 2, 2010, at 12:56 PM, Richard R. Liu wrote:
>
> > I have an array with names which contain multibyte characters.  When 
> > I try to
> > write the array to a file using write.table and row.names = T I 
> > receive an error
> > message when the first such name is encountered, saying that I have 
> > not
> > specified the option to generate NA instead.  I really would be 
> > satisfied if the
> > row name in the file were exactly what is displayed when I print the 
> > array on
> > the console, e.g., "en.\xc2".  The only way I have found to avoid 
> > this is create
> > a new array containing in one column a deparse of the original row 
> > name and in
> > the other the value.  This "solution" is ugly; "en.\xc2" becomes 
> > "\"en.\\xc2\"".
> >
>
> > Is there a more straight forward way of dealing with multibyte 
> > characters?
>
> Do you want to provide a worked example that produces the error? I am 
> not getting such an error
>
>  > mtx <-  matrix(1, nrow=1)
>  > rownames(mtx) <- "en.\xc2"
>  > mtx
>          [,1]
> en.\xc2    1
>  > write.table(mtx, file="test.txt")
>
> What I see in that file is
>
> "V1"
> "en.¬" 1
>
> (The character following the period is a logical negation symbol (or 
> an IBM keyboard carriage return) on my display.)
> --
> David Winsemius, MD
> West Hartford, CT
>


Richard R. Liu
richard.liu at pueo-owl.ch



More information about the R-help mailing list