[R] Encoding problem - I fails to read Hebrew text from online

Matt Shotwell shotwelm at musc.edu
Thu Dec 9 23:00:15 CET 2010


Tal, 

OK, let me clarify my understanding. The original and decoded file are
text, encoded by UTF-8. In the original file, there are HTML `entities'
that represent UTF-8 Hebrew characters. In the decoded file, the
entities are converted to UTF-8 characters. The question is how to
convert these entities within R. It's not the same as converting between
character encodings, otherwise iconv() might offer a solution.

I'll have a look around to find a solution, and I hope others will too.
My first idea is to check RCurl, XML, and the related utils::URLdecode.
If there really is no existing solution, I think it might be worthwhile
to look at how PHP and Python do it (and maybe borrow some code :) ).

-Matt


On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote:
> Hi Matt,
> Thanks for having a look at this.
> I just spent some time looking around and couldn't find any R function
> to decode  decimal HTML code.
> 
> 
> Do you (or someone else on the list) knows how to program this sort of
> thing? (is there a formula for the translation?
> 
> 
> 
> 
> p.s:
> For it to work on my end I added the encoding parameter:
> readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE,
> encoding= "UTF-8")
> 
> 
> p.p.s: The Hebrew word I used means "peace" 
> 
> 
> Cheers,
> Tal
> 
> 
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
> | www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
> 
> 
> 
> 
> On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotwelm at musc.edu>
> wrote:
>         Tal,
>         
>         It looks like the data you received has HTML special hex
>         characters.
>         That is, '&#x5E9;' is just an ASCII HTML representation of a
>         hex
>         character. It's not encoded in a special manner.
>         
>         The trick is to substitute the HTML encoded hex character for
>         its binary
>         representation, or "decode" the character. I don't know of any
>         R
>         function that does this, but there are web services, for
>         example:
>         http://www.hashemian.com/tools/html-url-encode-decode.php
>         
>         I decoded your file using this service and posted it on my
>         website. You
>         can see the difference by running:
>         
>         readLines("http://biostatmatt.com/temp/Hebrew-original",
>         warn=FALSE)
>         
>         readLines("http://biostatmatt.com/temp/Hebrew-decoded",
>         warn=FALSE)
>         
>         The second should display the Hebrew characters correctly (it
>         does in my
>         terminal). The next thing to think about is how to automate
>         this in R
>         without using the web service... We may need to write an
>         HTMLDecode
>         function if there isn't one already.
>         
>         By the way, what's the Hebrew text in English?
>         
>         Best,
>         Matt
>         
>         
>         
>         On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:
>         > I am bumping this question in the hopes that someone might
>         be able to
>         > advise.
>         > This Hebrew and R business is not as smooth as I had
>         hoped...
>         >
>         > Thanks,
>         > Tal
>         >
>         > Older massage:
>         >
>         > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili
>         <tal.galili at gmail.com> wrote:
>         >
>         > > Hello all,
>         > >
>         > > # I am trying to read the text in this URL:
>         > > u <-
>         > > http://google.com/complete/search?output=toolbar&q=%d7%a9%
>         d7%9c%d7%95%d7%9d
>         > > # By using this command:
>         > > readLines(u)
>         > >
>         > > And no matter what variation I tried, I keep getting this
>         output:
>         > > [1] "<?xml version=\"1.0
>         \"?><toplevel><CompleteSuggestion><suggestion
>         > > data=\"&#x5E9;&#x5DC;&#x5D5;&#x5DD;\"/><   (etc...)
>         > >
>         >
>         >
>         > > Instead of this output:
>         > > <?xml
>         version="1.0"?><toplevel><CompleteSuggestion><suggestion
>         data="שלום
>         > > "/><num_queries
>         int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion
>         > > data="שלום חנוך"/><num_queries
>         int="232000"/></CompleteSuggestion>
>         > > <CompleteSuggestion><suggestion data="שלום עליכם"/
>         > > (etc....)
>         > >
>         > >
>         >
>         > > I tried:
>         > >   readLines(u, encoding= "latin1")
>         > >   readLines(u, encoding= "UTF-8")
>         > > And also changing Sys.setlocale:
>         > >   Sys.setlocale("LC_ALL", "Hebrew") # must be done for
>         Hebrew to work.
>         > >   Sys.setlocale("LC_ALL", "English") # must be done for
>         Hebrew to work.
>         > >
>         > > Are there any more options I could try to get this text
>         properly encoded?
>         > >
>         > > Thanks!
>         > > Tal
>         > >
>         > >
>         > >
>         > > ----------------Contact
>         > >
>         Details:-------------------------------------------------------
>         > > Contact me: Tal.Galili at gmail.com |  972-52-7275845
>         > > Read me: www.talgalili.com (Hebrew) |
>         www.biostatistics.co.il (Hebrew) |
>         > > www.r-statistics.com (English)
>         > >
>         > >
>         ----------------------------------------------------------------------------------------------
>         > >
>         > >
>         > >
>         >
>         
>         >       [[alternative HTML version deleted]]
>         >
>         
>         --
>         Matthew S. Shotwell
>         Graduate Student
>         Division of Biostatistics and Epidemiology
>         Medical University of South Carolina
>         
> 
> 

-- 
Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina



More information about the R-help mailing list