[R] reading a character translation table into R

Duncan Murdoch murdoch.duncan at gmail.com
Sat Jun 8 22:51:33 CEST 2013


On 13-06-08 4:31 PM, Michael Friendly wrote:
> I have a txt file (attached) that defines equivalents among characters
> in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
> and latex equivalents.  A portion of the file is shown inline below, but
> may not be rendered well in this email.
>
> I'd like to read this into R to use as a character translation table,
> but am stuck on two things:
> - The 5 fields in the file are column-aligned and are separated by 2+
> white space characters.
> In perl this is trivial to read and parse via something like
>           @entries = split("\n", $charTable);
>           foreach (@entries) {
>                   ($desc, $char, $code, $html, $tex) = split(/\s\s+/);
>           }
> AFAIK, the only function for reading such data is utils::read.fwf, but I
> have to specify the field widths.
> I don't know of any function that allows even a simple regrex like this
> as a sep= argument.

I see two ways to do this.  Work out the column numbers and use 
read.fwf, or read whole lines, and use sub() to extract columns.  The 
latter is pretty close to the spirit of the Perl method, e.g.

lines <- readLines( filename )
regex <- paste(rep("([^[:space:]]*)[[:space:]]*", 5), collapse="")
desc <- sub(regex, "\\1", lines)
char <- sub(regex, "\\2", lines)
etc.

(Actually, this doesn't work, because the desc field contains embedded 
spaces; I don't think the Perl would work either.  But if you can work 
out the regexp to match the first field, or just extract it using 
substr(), you're good.)

Duncan Murdoch

>
> - The TeX field contains many backslashed codes that need to be escaped
> in R. Is it necessarty
> to manually edit the file to change '\pounds' --> '\\pounds', '\S' -->
> '\\S', etc. or is there something
> like raw mode input that would do this where necessary?
>
> Description                         Char
>    Code      HTML        TeX
> double quote                         "    " "
> ampersand                            &    & &amp        \&
> apostrophe                           '    ' '
> less than                            <    < <        $<$
> greater than                         >    > >        $>$
> non-breaking space                   .             ~
> inverted exclamation                 ¡    ¡ ¡     !'
> cent sign                            ¢    ¢ ¢
> pound sterling                       £    £ £     \pounds
> general currency sign                ¤    ¤ ¤
> yen sign                             ¥    ¥ ¥
> broken vertical bar                  ¦    ¦ ¦
> section sign                         §    § §      \S
> umlaut (dieresis)                    ¨    ¨ ¨       \"{}
> copyright                            ©    © ©      \copyright
> feminine ordinal                     ª    ª ª      $^a$
> left angle quote, guillemotleft      «    « «     \guillemotleft
> not sign                             ¬    ¬ ¬
> soft hyphen                          ­    ­ ­
> registered trademark                 ®    ® ®       \textregistered
> macron accent                        ¯    ¯ ¯
> degree sign                          °    ° °       $^o$
> plus or minus                        ±    ± ±    $\pm$
> superscript two                      ²    ² &sup2;      $^2$
> superscript three                    ³    ³ &sup3;      $^3$
> acute accent                         ´    ´ ´     \'{}
> micro sign                           µ    µ µ     $\mu$
> paragraph sign                       ¶    ¶ ¶      \P
> middle dot                           ·    · ·    $\cdot$
> cedilla                              ¸    ¸ ¸     \c{}
> superscript one                      ¹    ¹ &sup1;      $^1$
> masculine ordinal                    º    º º      $^o$
> right angle quote, guillemotright    »    » »     \guillemotright
> fraction one-fourth                  ¼    ¼ &frac14;    $\frac14$
> fraction one-half                    ½    ½ &frac12;    $\frac12$
> fraction three-fourths               ¾    ¾ &frac34;    $\frac34$
> inverted question mark               ¿    ¿ ¿    ?'
> capital A, grave accent              À    À À    \`A
> capital A, acute accent              Á    Á Á    \'A
> capital A, circumflex accent         Â    Â Â     \^A
> capital A, tilde                     Ã    Ã Ã    \~A
> capital A, dieresis or umlaut mark   Ä    Ä Ä      \"A
> capital A, ring                      Å    Å Å     \AA
> capital AE diphthong (ligature)      Æ    Æ Æ     \AE
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list