[R] Encoding problem - I fails to read Hebrew text from online
Matt Shotwell
shotwelm at musc.edu
Thu Dec 9 23:00:15 CET 2010
Tal,
OK, let me clarify my understanding. The original and decoded file are
text, encoded by UTF-8. In the original file, there are HTML `entities'
that represent UTF-8 Hebrew characters. In the decoded file, the
entities are converted to UTF-8 characters. The question is how to
convert these entities within R. It's not the same as converting between
character encodings, otherwise iconv() might offer a solution.
I'll have a look around to find a solution, and I hope others will too.
My first idea is to check RCurl, XML, and the related utils::URLdecode.
If there really is no existing solution, I think it might be worthwhile
to look at how PHP and Python do it (and maybe borrow some code :) ).
-Matt
On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote:
> Hi Matt,
> Thanks for having a look at this.
> I just spent some time looking around and couldn't find any R function
> to decode decimal HTML code.
>
>
> Do you (or someone else on the list) knows how to program this sort of
> thing? (is there a formula for the translation?
>
>
>
>
> p.s:
> For it to work on my end I added the encoding parameter:
> readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE,
> encoding= "UTF-8")
>
>
> p.p.s: The Hebrew word I used means "peace"
>
>
> Cheers,
> Tal
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
> | www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
>
>
>
> On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotwelm at musc.edu>
> wrote:
> Tal,
>
> It looks like the data you received has HTML special hex
> characters.
> That is, 'ש' is just an ASCII HTML representation of a
> hex
> character. It's not encoded in a special manner.
>
> The trick is to substitute the HTML encoded hex character for
> its binary
> representation, or "decode" the character. I don't know of any
> R
> function that does this, but there are web services, for
> example:
> http://www.hashemian.com/tools/html-url-encode-decode.php
>
> I decoded your file using this service and posted it on my
> website. You
> can see the difference by running:
>
> readLines("http://biostatmatt.com/temp/Hebrew-original",
> warn=FALSE)
>
> readLines("http://biostatmatt.com/temp/Hebrew-decoded",
> warn=FALSE)
>
> The second should display the Hebrew characters correctly (it
> does in my
> terminal). The next thing to think about is how to automate
> this in R
> without using the web service... We may need to write an
> HTMLDecode
> function if there isn't one already.
>
> By the way, what's the Hebrew text in English?
>
> Best,
> Matt
>
>
>
> On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:
> > I am bumping this question in the hopes that someone might
> be able to
> > advise.
> > This Hebrew and R business is not as smooth as I had
> hoped...
> >
> > Thanks,
> > Tal
> >
> > Older massage:
> >
> > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili
> <tal.galili at gmail.com> wrote:
> >
> > > Hello all,
> > >
> > > # I am trying to read the text in this URL:
> > > u <-
> > > http://google.com/complete/search?output=toolbar&q=%d7%a9%
> d7%9c%d7%95%d7%9d
> > > # By using this command:
> > > readLines(u)
> > >
> > > And no matter what variation I tried, I keep getting this
> output:
> > > [1] "<?xml version=\"1.0
> \"?><toplevel><CompleteSuggestion><suggestion
> > > data=\"שלום\"/>< (etc...)
> > >
> >
> >
> > > Instead of this output:
> > > <?xml
> version="1.0"?><toplevel><CompleteSuggestion><suggestion
> data="שלום
> > > "/><num_queries
> int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion
> > > data="שלום חנוך"/><num_queries
> int="232000"/></CompleteSuggestion>
> > > <CompleteSuggestion><suggestion data="שלום עליכם"/
> > > (etc....)
> > >
> > >
> >
> > > I tried:
> > > readLines(u, encoding= "latin1")
> > > readLines(u, encoding= "UTF-8")
> > > And also changing Sys.setlocale:
> > > Sys.setlocale("LC_ALL", "Hebrew") # must be done for
> Hebrew to work.
> > > Sys.setlocale("LC_ALL", "English") # must be done for
> Hebrew to work.
> > >
> > > Are there any more options I could try to get this text
> properly encoded?
> > >
> > > Thanks!
> > > Tal
> > >
> > >
> > >
> > > ----------------Contact
> > >
> Details:-------------------------------------------------------
> > > Contact me: Tal.Galili at gmail.com | 972-52-7275845
> > > Read me: www.talgalili.com (Hebrew) |
> www.biostatistics.co.il (Hebrew) |
> > > www.r-statistics.com (English)
> > >
> > >
> ----------------------------------------------------------------------------------------------
> > >
> > >
> > >
> >
>
> > [[alternative HTML version deleted]]
> >
>
> --
> Matthew S. Shotwell
> Graduate Student
> Division of Biostatistics and Epidemiology
> Medical University of South Carolina
>
>
>
--
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
More information about the R-help
mailing list