[R] reading a character translation table into R
Michael Friendly
friendly at yorku.ca
Sat Jun 8 22:31:57 CEST 2013
I have a txt file (attached) that defines equivalents among characters
in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
and latex equivalents. A portion of the file is shown inline below, but
may not be rendered well in this email.
I'd like to read this into R to use as a character translation table,
but am stuck on two things:
- The 5 fields in the file are column-aligned and are separated by 2+
white space characters.
In perl this is trivial to read and parse via something like
@entries = split("\n", $charTable);
foreach (@entries) {
($desc, $char, $code, $html, $tex) = split(/\s\s+/);
}
AFAIK, the only function for reading such data is utils::read.fwf, but I
have to specify the field widths.
I don't know of any function that allows even a simple regrex like this
as a sep= argument.
- The TeX field contains many backslashed codes that need to be escaped
in R. Is it necessarty
to manually edit the file to change '\pounds' --> '\\pounds', '\S' -->
'\\S', etc. or is there something
like raw mode input that would do this where necessary?
Description Char
Code HTML TeX
double quote " " "
ampersand & & & \&
apostrophe ' ' '
less than < < < $<$
greater than > > > $>$
non-breaking space . ~
inverted exclamation ¡ ¡ ¡ !'
cent sign ¢ ¢ ¢
pound sterling £ £ £ \pounds
general currency sign ¤ ¤ ¤
yen sign ¥ ¥ ¥
broken vertical bar ¦ ¦ ¦
section sign § § § \S
umlaut (dieresis) ¨ ¨ ¨ \"{}
copyright © © © \copyright
feminine ordinal ª ª ª $^a$
left angle quote, guillemotleft « « « \guillemotleft
not sign ¬ ¬ ¬
soft hyphen
registered trademark ® ® ® \textregistered
macron accent ¯ ¯ ¯
degree sign ° ° ° $^o$
plus or minus ± ± ± $\pm$
superscript two ² ² ² $^2$
superscript three ³ ³ ³ $^3$
acute accent ´ ´ ´ \'{}
micro sign µ µ µ $\mu$
paragraph sign ¶ ¶ ¶ \P
middle dot · · · $\cdot$
cedilla ¸ ¸ ¸ \c{}
superscript one ¹ ¹ ¹ $^1$
masculine ordinal º º º $^o$
right angle quote, guillemotright » » » \guillemotright
fraction one-fourth ¼ ¼ ¼ $\frac14$
fraction one-half ½ ½ ½ $\frac12$
fraction three-fourths ¾ ¾ ¾ $\frac34$
inverted question mark ¿ ¿ ¿ ?'
capital A, grave accent À À À \`A
capital A, acute accent Á Á Á \'A
capital A, circumflex accent    \^A
capital A, tilde à à à \~A
capital A, dieresis or umlaut mark Ä Ä Ä \"A
capital A, ring Å Å Å \AA
capital AE diphthong (ligature) Æ Æ Æ \AE
--
Michael Friendly Email: friendly at yorku.ca
Professor, Psychology Dept.
York University Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street http://datavis.ca
Toronto, ONT M3J 1P3 CANADA
-------------- next part --------------
Description Char Code HTML TeX
double quote " " "
ampersand & & & \&
apostrophe ' ' '
less than < < < $<$
greater than > > > $>$
non-breaking space . ~
inverted exclamation ¡ ¡ ¡ !'
cent sign ¢ ¢ ¢
pound sterling £ £ £ \pounds
general currency sign ¤ ¤ ¤
yen sign ¥ ¥ ¥
broken vertical bar ¦ ¦ ¦
section sign § § § \S
umlaut (dieresis) ¨ ¨ ¨ \"{}
copyright © © © \copyright
feminine ordinal ª ª ª $^a$
left angle quote, guillemotleft « « « \guillemotleft
not sign ¬ ¬ ¬
soft hyphen
registered trademark ® ® ® \textregistered
macron accent ¯ ¯ ¯
degree sign ° ° ° $^o$
plus or minus ± ± ± $\pm$
superscript two ² ² ² $^2$
superscript three ³ ³ ³ $^3$
acute accent ´ ´ ´ \'{}
micro sign µ µ µ $\mu$
paragraph sign ¶ ¶ ¶ \P
middle dot · · · $\cdot$
cedilla ¸ ¸ ¸ \c{}
superscript one ¹ ¹ ¹ $^1$
masculine ordinal º º º $^o$
right angle quote, guillemotright » » » \guillemotright
fraction one-fourth ¼ ¼ ¼ $\frac14$
fraction one-half ½ ½ ½ $\frac12$
fraction three-fourths ¾ ¾ ¾ $\frac34$
inverted question mark ¿ ¿ ¿ ?'
capital A, grave accent À À À \`A
capital A, acute accent Á Á Á \'A
capital A, circumflex accent    \^A
capital A, tilde à à à \~A
capital A, dieresis or umlaut mark Ä Ä Ä \"A
capital A, ring Å Å Å \AA
capital AE diphthong (ligature) Æ Æ Æ \AE
capital C, cedilla Ç Ç Ç \c{C}
capital E, grave accent È È È \`E
capital E, acute accent É É É \'E
capital E, circumflex accent Ê Ê Ê \^E
capital E, dieresis or umlaut mark Ë Ë Ë \"E
capital I, grave accent Ì Ì Ì \`I
capital I, acute accent Í Í Í \'I
capital I, circumflex accent Î Î Î \^I
capital I, dieresis or umlaut mark Ï Ï Ï \"I
capital Eth, Icelandic Ð Ð Ð
capital N, tilde Ñ Ñ Ñ \~N
capital O, grave accent Ò Ò Ò \`O
capital O, acute accent Ó Ó Ó \'O
capital O, circumflex accent Ô Ô Ô \^O
capital O, tilde Õ Õ Õ \~O
capital O, dieresis or umlaut mark Ö Ö Ö \"O
multiply sign × × × $\times$
capital O, slash Ø Ø Ø {\O}
capital U, grave accent Ù Ù Ù \`U
capital U, acute accent Ú Ú Ú \'U
capital U, circumflex accent Û Û Û \^U
capital U, dieresis or umlaut mark Ü Ü Ü \"A
capital Y, acute accent Ý Ý Ý \'Y
capital THORN, Icelandic Þ Þ Þ \TH
small sharp s, German (sz ligature) ß ß ß \ss
small a, grave accent à à à \`a
small a, acute accent á á á \'a
small a, circumflex accent â â â \^a
small a, tilde ã ã ã \~a
small a, dieresis or umlaut mark ä ä ä \"a
small a, ring å å å \aa
small ae diphthong (ligature) æ æ æ \ae
small c, cedilla ç ç ç \c{c}
small e, grave accent è è è \`e
small e, acute accent é é é \'e
small e, circumflex accent ê ê ê \^e
small e, dieresis or umlaut mark ë ë ë \"e
small i, grave accent ì ì ì \`i
small i, acute accent í í í \'i
small i, circumflex accent î î î \^i
small i, dieresis or umlaut mark ï ï ï \"i
small eth, Icelandic ð ð ð
small n, tilde ñ ñ ñ \~n
small o, grave accent ò ò ò \`o
small o, acute accent ó ó ó \'o
small o, circumflex accent ô ô ô \^o
small o, tilde õ õ õ \~o
small o, dieresis or umlaut mark ö ö ö \"o
division sign ÷ ÷ ÷ $\divide$
small o, slash ø ø ø {\o}
small u, grave accent ù ù ù \`u
small u, acute accent ú ú ú \'u
small u, circumflex accent û û û \^u
small u, dieresis or umlaut mark ü ü ü \"u
small y, acute accent ý ý ý \'y
small thorn, Icelandic þ þ þ \th
small y, dieresis or umlaut mark ÿ ÿ ÿ \"y
More information about the R-help
mailing list