[R] TM reader with text

Thu Mar 1 20:23:46 CET 2012

Le jeudi 01 mars 2012 à 07:07 -0800, Mickael R problem a écrit :
> Hi Richard,
> clearly there is a problem with latin ligature because the word resulting
> from my ask with  findFreqTerms give me some words >           "<U+FB01>n"          
> "<U+FB01>nancement"
> >> "<U+FB01>nancier"     "<U+FB01>nanciÃ¨re"    "<U+FB01>nanciÃ¨res"
> >> "<U+FB01>nanciers"    "<U+FB01>xe" 
>  where U+FB01 is a code for latin ligature. The problem is well identified
> ok.
> 
> Now, how can I tretaed it. The package TAU seems to offer a solution for
> text but not for corpus. 
> 
> quoation TAU " translate Translate Unicode Latin Ligatures Description
> Translate Unicode “Latin ligature” characters to their respective
> constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
> x a character vector in UTF-8 encoding. 
> Details In typography, a ligature occurs where two or more graphemes are
> joined as a single glyph. (See
> http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
> Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
> Code Name
> 0132 LATIN CAPITAL LIGATURE IJ
> 0133 LATIN SMALL LIGATURE IJ
> 0152 LATIN CAPITAL LIGATURE OE
> 0153 LATIN SMALL LIGATURE OE
> FB00 LATIN SMALL LIGATURE FF
> util 9
> FB01 LATIN SMALL LIGATURE FI
> FB02 LATIN SMALL LIGATURE FL
> FB03 LATIN SMALL LIGATURE FFI
> FB04 LATIN SMALL LIGATURE FFL
> FB05 LATIN SMALL LIGATURE LONG S T
> FB06 LATIN SMALL LIGATURE ST
> 
> translate_Unicode_latin_ligatures translates these to their respective
> constituent characters.
> 
> I need this king of fonction for corpus not only text or characters. Any
> ideas ?
Try:
corpus <- tm_map(corpus, translate_Unicode_latin_ligatures)
(with 'corpus' your corpus, of course ;-)