[R] Russian language in R
Duncan Murdoch
murdoch.duncan at gmail.com
Mon May 16 14:41:26 CEST 2011
On 16/05/2011 8:33 AM, Lyolya wrote:
> Dear Duncan,
>
> Thank you very much for your reply!
>
> I have tried what you have suggested. R was definitely assuming a different
> text encoding, and after trying the l10n_info() command, I got the
> following:
>
> l10n_info()
> $MBCS
> [1] TRUE
>
> $`UTF-8`
> [1] TRUE
>
> $`Latin-1`
> [1] FALSE
>
> My data is a dataframe (stored both in .xls and .dbf files) that represents
> the secondary housing market for Moscow for a given period of time. The
> problem is that the factors are given by Russian strings (those like general
> condition of the dwelling and the material the house is built of), and R
> does not read them correctly. This makes the analysis really complicated.
>
> In order to read the file, I do the following:
>
> require(foreign)
> MSL_1010<- read.dbf("MSL_1010.dbf") # I tried both as.is=TRUE and FALSE
>
> and then when it comes to strings it reads something like: \x96\x80\x8e.
I'm not familiar with Russian encodings. If you know what encoding is
in the file, you may be able to use iconv() to convert it to UTF-8,
which the l10n_info function says is native to your system. To
simplify things, use
read.dbf( "MSL_1010.dbf", as.is = TRUE)
so that you don't have to worry about factors and factor names. Then try
iconv(x, from="KOI8-R", to="UTF-8")
where x is one of the character vectors with bad characters. If that
doesn't work, try a different possible encoding (e.g. cp1251).
Duncan Murdoch
>
> On 14 May 2011 01:08, Duncan Murdoch<murdoch.duncan at gmail.com> wrote:
>
> > On 13/05/2011 4:57 PM, lyolya wrote:
> >
> >> Hello,
> >>
> >> I am experiencing a problem in reading a database in Russian. The problem
> >> appears when it comes to char variables. I have already tried changing the
> >> encoding, i.e.
> >>
> >> options(encoding="UTF-8")
> >>
> >> and
> >>
> >> options(encoding="KOI8-R")
> >>
> >> but every time there appear to be something unreadable in the data frame,
> >> like \x82\xa2\xae\xef etc.
> >>
> >> Could you please answer whether it is possible to operate with Russian
> >> strings in R, and, if yes, how to get to do that. Thank you, in advance.
> >>
> >
> > Yes, it is possible. You can test it using a text editor that supports
> > Russian. Just put
> >
> > x<- " some Russian text "
> >
> > into the file, the use source() to read the filename. Two things are
> > likely outcomes:
> >
> > x will be defined to be a string holding Russian text, and it will display
> > properly.
> >
> > OR
> >
> > it will be defined to be a string with lots of escapes or mis-displayed
> > characters in it. In the latter case, the problem is that R is assuming a
> > different encoding than your text editor. The l10n_info() will display
> > information about what R is expecting.
> >
> > If none of the above helps you to get your code working, then you'll have
> > to give details on exactly what you're doing to read the file, and exactly
> > what is in the file.
> >
> > Duncan Murdoch
> >
>
>
>
More information about the R-help
mailing list