[R] TM reader with text
Mickael R problem
clevenot.mickael at gmail.com
Thu Mar 1 16:07:15 CET 2012
Hi Richard,
clearly there is a problem with latin ligature because the word resulting
from my ask with findFreqTerms give me some words > "<U+FB01>n"
"<U+FB01>nancement"
>> "<U+FB01>nancier" "<U+FB01>nancière" "<U+FB01>nancières"
>> "<U+FB01>nanciers" "<U+FB01>xe"
where U+FB01 is a code for latin ligature. The problem is well identified
ok.
Now, how can I tretaed it. The package TAU seems to offer a solution for
text but not for corpus.
quoation TAU " translate Translate Unicode Latin Ligatures Description
Translate Unicode “Latin ligature” characters to their respective
constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
x a character vector in UTF-8 encoding.
Details In typography, a ligature occurs where two or more graphemes are
joined as a single glyph. (See
http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
Code Name
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
0152 LATIN CAPITAL LIGATURE OE
0153 LATIN SMALL LIGATURE OE
FB00 LATIN SMALL LIGATURE FF
util 9
FB01 LATIN SMALL LIGATURE FI
FB02 LATIN SMALL LIGATURE FL
FB03 LATIN SMALL LIGATURE FFI
FB04 LATIN SMALL LIGATURE FFL
FB05 LATIN SMALL LIGATURE LONG S T
FB06 LATIN SMALL LIGATURE ST
translate_Unicode_latin_ligatures translates these to their respective
constituent characters.
I need this king of fonction for corpus not only text or characters. Any
ideas ?
Thank's for comments and your answers. We are in progress!
Mickaël
--
View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4435229.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list