[R] TM reader with text

Mickael R problem clevenot.mickael at gmail.com
Thu Mar 1 16:07:15 CET 2012


Hi Richard,
clearly there is a problem with latin ligature because the word resulting
from my ask with  findFreqTerms give me some words >           "<U+FB01>n"          
"<U+FB01>nancement"
>> "<U+FB01>nancier"     "<U+FB01>nancière"    "<U+FB01>nancières"
>> "<U+FB01>nanciers"    "<U+FB01>xe" 
 where U+FB01 is a code for latin ligature. The problem is well identified
ok.

Now, how can I tretaed it. The package TAU seems to offer a solution for
text but not for corpus. 

quoation TAU " translate Translate Unicode Latin Ligatures Description
Translate Unicode “Latin ligature” characters to their respective
constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
x a character vector in UTF-8 encoding. 
Details In typography, a ligature occurs where two or more graphemes are
joined as a single glyph. (See
http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
Code Name
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
0152 LATIN CAPITAL LIGATURE OE
0153 LATIN SMALL LIGATURE OE
FB00 LATIN SMALL LIGATURE FF
util 9
FB01 LATIN SMALL LIGATURE FI
FB02 LATIN SMALL LIGATURE FL
FB03 LATIN SMALL LIGATURE FFI
FB04 LATIN SMALL LIGATURE FFL
FB05 LATIN SMALL LIGATURE LONG S T
FB06 LATIN SMALL LIGATURE ST

translate_Unicode_latin_ligatures translates these to their respective
constituent characters.

I need this king of fonction for corpus not only text or characters. Any
ideas ?
Thank's for comments and your answers. We are in progress!
Mickaël

--
View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4435229.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list