[R] reading a character translation table into R

Michael Friendly friendly at yorku.ca
Sat Jun 8 22:31:57 CEST 2013


I have a txt file (attached) that defines equivalents among characters 
in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
and latex equivalents.  A portion of the file is shown inline below, but 
may not be rendered well in this email.

I'd like to read this into R to use as a character translation table, 
but am stuck on two things:
- The 5 fields in the file are column-aligned and are separated by 2+ 
white space characters.
In perl this is trivial to read and parse via something like
         @entries = split("\n", $charTable);
         foreach (@entries) {
                 ($desc, $char, $code, $html, $tex) = split(/\s\s+/);
         }
AFAIK, the only function for reading such data is utils::read.fwf, but I 
have to specify the field widths.
I don't know of any function that allows even a simple regrex like this 
as a sep= argument.

- The TeX field contains many backslashed codes that need to be escaped 
in R. Is it necessarty
to manually edit the file to change '\pounds' --> '\\pounds', '\S' --> 
'\\S', etc. or is there something
like raw mode input that would do this where necessary?

Description                         Char
  Code      HTML        TeX
double quote                         "    " "
ampersand                            &    & &amp        \&
apostrophe                           '    ' '
less than                            <    < <        $<$
greater than                         >    > >        $>$
non-breaking space                   .             ~
inverted exclamation                 ¡    ¡ ¡     !'
cent sign                            ¢    ¢ ¢
pound sterling                       £    £ £     \pounds
general currency sign                ¤    ¤ ¤
yen sign                             ¥    ¥ ¥
broken vertical bar                  ¦    ¦ ¦
section sign                         §    § §      \S
umlaut (dieresis)                    ¨    ¨ ¨       \"{}
copyright                            ©    © ©      \copyright
feminine ordinal                     ª    ª ª      $^a$
left angle quote, guillemotleft      «    « «     \guillemotleft
not sign                             ¬    ¬ ¬
soft hyphen                          ­    ­ ­
registered trademark                 ®    ® ®       \textregistered
macron accent                        ¯    ¯ ¯
degree sign                          °    ° °       $^o$
plus or minus                        ±    ± ±    $\pm$
superscript two                      ²    ² &sup2;      $^2$
superscript three                    ³    ³ &sup3;      $^3$
acute accent                         ´    ´ ´     \'{}
micro sign                           µ    µ µ     $\mu$
paragraph sign                       ¶    ¶ ¶      \P
middle dot                           ·    · ·    $\cdot$
cedilla                              ¸    ¸ ¸     \c{}
superscript one                      ¹    ¹ &sup1;      $^1$
masculine ordinal                    º    º º      $^o$
right angle quote, guillemotright    »    » »     \guillemotright
fraction one-fourth                  ¼    ¼ &frac14;    $\frac14$
fraction one-half                    ½    ½ &frac12;    $\frac12$
fraction three-fourths               ¾    ¾ &frac34;    $\frac34$
inverted question mark               ¿    ¿ ¿    ?'
capital A, grave accent              À    À À    \`A
capital A, acute accent              Á    Á Á    \'A
capital A, circumflex accent         Â    Â Â     \^A
capital A, tilde                     Ã    Ã Ã    \~A
capital A, dieresis or umlaut mark   Ä    Ä Ä      \"A
capital A, ring                      Å    Å Å     \AA
capital AE diphthong (ligature)      Æ    Æ Æ     \AE

-- 
Michael Friendly     Email: friendly at yorku.ca
Professor, Psychology Dept.
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    http://datavis.ca
Toronto, ONT  M3J 1P3 CANADA

-------------- next part --------------
Description                          Char Code      HTML        TeX
double quote                         "    "    " 
ampersand                            &    &    &amp        \&     
apostrophe                           '    '    '
less than                            <    <    <        $<$
greater than                         >    >    >        $>$  
non-breaking space                   .                ~
inverted exclamation                 ¡    ¡    ¡     !'
cent sign                            ¢    ¢    ¢
pound sterling                       £    £    £     \pounds
general currency sign                ¤    ¤    ¤
yen sign                             ¥    ¥    ¥
broken vertical bar                  ¦    ¦    ¦
section sign                         §    §    §      \S
umlaut (dieresis)                    ¨    ¨    ¨       \"{}
copyright                            ©    ©    ©      \copyright
feminine ordinal                     ª    ª    ª      $^a$
left angle quote, guillemotleft      «    «    «     \guillemotleft
not sign                             ¬    ¬    ¬
soft hyphen                          ­    ­    ­
registered trademark                 ®    ®    ®       \textregistered
macron accent                        ¯    ¯    ¯
degree sign                          °    °    °       $^o$
plus or minus                        ±    ±    ±    $\pm$
superscript two                      ²    ²    &sup2;      $^2$
superscript three                    ³    ³    &sup3;      $^3$
acute accent                         ´    ´    ´     \'{}
micro sign                           µ    µ    µ     $\mu$
paragraph sign                       ¶    ¶    ¶      \P
middle dot                           ·    ·    ·    $\cdot$
cedilla                              ¸    ¸    ¸     \c{}
superscript one                      ¹    ¹    &sup1;      $^1$
masculine ordinal                    º    º    º      $^o$
right angle quote, guillemotright    »    »    »     \guillemotright
fraction one-fourth                  ¼    ¼    &frac14;    $\frac14$
fraction one-half                    ½    ½    &frac12;    $\frac12$
fraction three-fourths               ¾    ¾    &frac34;    $\frac34$
inverted question mark               ¿    ¿    ¿    ?'
capital A, grave accent              À    À    À    \`A
capital A, acute accent              Á    Á    Á    \'A
capital A, circumflex accent         Â    Â    Â     \^A
capital A, tilde                     Ã    Ã    Ã    \~A
capital A, dieresis or umlaut mark   Ä    Ä    Ä      \"A
capital A, ring                      Å    Å    Å     \AA
capital AE diphthong (ligature)      Æ    Æ    Æ     \AE
capital C, cedilla                   Ç    Ç    Ç    \c{C}
capital E, grave accent              È    È    È    \`E
capital E, acute accent              É    É    É    \'E
capital E, circumflex accent         Ê    Ê    Ê     \^E
capital E, dieresis or umlaut mark   Ë    Ë    Ë      \"E
capital I, grave accent              Ì    Ì    Ì    \`I
capital I, acute accent              Í    Í    Í    \'I
capital I, circumflex accent         Î    Î    Î     \^I
capital I, dieresis or umlaut mark   Ï    Ï    Ï      \"I
capital Eth, Icelandic               Ð    Ð    Ð
capital N, tilde                     Ñ    Ñ    Ñ    \~N
capital O, grave accent              Ò    Ò    Ò    \`O
capital O, acute accent              Ó    Ó    Ó    \'O
capital O, circumflex accent         Ô    Ô    Ô     \^O
capital O, tilde                     Õ    Õ    Õ    \~O
capital O, dieresis or umlaut mark   Ö    Ö    Ö      \"O
multiply sign                        ×    ×    ×     $\times$
capital O, slash                     Ø    Ø    Ø    {\O}
capital U, grave accent              Ù    Ù    Ù    \`U
capital U, acute accent              Ú    Ú    Ú    \'U
capital U, circumflex accent         Û    Û    Û     \^U
capital U, dieresis or umlaut mark   Ü    Ü    Ü      \"A
capital Y, acute accent              Ý    Ý    Ý    \'Y
capital THORN, Icelandic             Þ    Þ    Þ     \TH
small sharp s, German (sz ligature)  ß    ß    ß     \ss
small a, grave accent                à    à    à    \`a
small a, acute accent                á    á    á    \'a
small a, circumflex accent           â    â    â     \^a
small a, tilde                       ã    ã    ã    \~a
small a, dieresis or umlaut mark     ä    ä    ä      \"a
small a, ring                        å    å    å     \aa
small ae diphthong (ligature)        æ    æ    æ     \ae
small c, cedilla                     ç    ç    ç    \c{c}
small e, grave accent                è    è    è    \`e
small e, acute accent                é    é    é    \'e
small e, circumflex accent           ê    ê    ê     \^e
small e, dieresis or umlaut mark     ë    ë    ë      \"e
small i, grave accent                ì    ì    ì    \`i
small i, acute accent                í    í    í    \'i
small i, circumflex accent           î    î    î     \^i
small i, dieresis or umlaut mark     ï    ï    ï      \"i
small eth, Icelandic                 ð    ð    ð
small n, tilde                       ñ    ñ    ñ    \~n
small o, grave accent                ò    ò    ò    \`o
small o, acute accent                ó    ó    ó    \'o
small o, circumflex accent           ô    ô    ô     \^o
small o, tilde                       õ    õ    õ    \~o
small o, dieresis or umlaut mark     ö    ö    ö      \"o
division sign                        ÷    ÷    ÷    $\divide$
small o, slash                       ø    ø    ø    {\o}
small u, grave accent                ù    ù    ù    \`u
small u, acute accent                ú    ú    ú    \'u
small u, circumflex accent           û    û    û     \^u
small u, dieresis or umlaut mark     ü    ü    ü      \"u
small y, acute accent                ý    ý    ý    \'y
small thorn, Icelandic               þ    þ    þ     \th
small y, dieresis or umlaut mark     ÿ    ÿ    ÿ      \"y


More information about the R-help mailing list