[R] Russian language in R

Duncan Murdoch murdoch.duncan at gmail.com
Mon May 16 14:41:26 CEST 2011


On 16/05/2011 8:33 AM, Lyolya wrote:
> Dear Duncan,
>
> Thank you very much for your reply!
>
> I have tried what you have suggested. R was definitely assuming a different
> text encoding, and after trying the  l10n_info() command, I got the
> following:
>
> l10n_info()
> $MBCS
> [1] TRUE
>
> $`UTF-8`
> [1] TRUE
>
> $`Latin-1`
> [1] FALSE
>
> My data is a dataframe (stored both in .xls and .dbf files) that represents
> the secondary housing market for Moscow for a given period of time. The
> problem is that the factors are given by Russian strings (those like general
> condition of the dwelling and the material the house is built of), and R
> does not read them correctly. This makes the analysis really complicated.
>
> In order to read the file, I do the following:
>
> require(foreign)
> MSL_1010<- read.dbf("MSL_1010.dbf") # I tried both as.is=TRUE and FALSE
>
> and then when it comes to strings it reads something like: \x96\x80\x8e.

I'm not familiar with Russian encodings.  If you know what encoding is 
in the file, you may be able to use iconv() to convert it to UTF-8, 
which the l10n_info function says is native to your system.   To 
simplify things, use

read.dbf( "MSL_1010.dbf", as.is = TRUE)

so that you don't have to worry about factors and factor names.  Then try

iconv(x, from="KOI8-R", to="UTF-8")

where x is one of the character vectors with bad characters.  If that 
doesn't work, try a different possible encoding (e.g. cp1251).

Duncan Murdoch

>
> On 14 May 2011 01:08, Duncan Murdoch<murdoch.duncan at gmail.com>  wrote:
>
> >  On 13/05/2011 4:57 PM, lyolya wrote:
> >
> >>  Hello,
> >>
> >>  I am experiencing a problem in reading a database in Russian. The problem
> >>  appears when it comes to char variables. I have already tried changing the
> >>  encoding, i.e.
> >>
> >>  options(encoding="UTF-8")
> >>
> >>  and
> >>
> >>  options(encoding="KOI8-R")
> >>
> >>  but every time there appear to be something unreadable in the data frame,
> >>  like \x82\xa2\xae\xef etc.
> >>
> >>  Could you please answer whether it is possible to operate with Russian
> >>  strings in R, and, if yes, how to get to do that. Thank you, in advance.
> >>
> >
> >  Yes, it is possible.  You can test it using a text editor that supports
> >  Russian.  Just put
> >
> >  x<- " some Russian text "
> >
> >  into the file, the use source() to read the filename.  Two things are
> >  likely outcomes:
> >
> >  x will be defined to be a string holding Russian text, and it will display
> >  properly.
> >
> >  OR
> >
> >  it will be defined to be a string with lots of escapes or mis-displayed
> >  characters in it.  In the latter case, the problem is that R is assuming a
> >  different encoding than your text editor.  The l10n_info() will display
> >  information about what R is expecting.
> >
> >  If none of the above helps you to get your code working, then you'll have
> >  to give details on exactly what you're doing to read the file, and exactly
> >  what is in the file.
> >
> >  Duncan Murdoch
> >
>
>
>



More information about the R-help mailing list