[R] TM reader with text

Mickael R problem clevenot.mickael at gmail.com
Sun Mar 4 01:56:37 CET 2012


Hello everybody,
I don't give up the fight, but it's hard. I have finded a solution for the
ligature with a best converter wich tranlated more precisely PDF to plain
text. But a new problem has occured. In french particulary, but it should be
the case in english too, I have a big problem ' " brackets wich polluted the
counting of the words. Actullaly the fonction remove ponctuation are not
able to treated this "punctuation". 

The solution should be to produce a more precise fonction in remove
punctation which allowed to destroy any bracket. The problem is that
brackets are not separeted of the word with space, but normally there are
jsut before or after the word. So, remove punctuation undertand the bracket
as a part of the word. 
 Another problem, less important, is the bad account of words in reason of s
or not and so on. For the fonction TermDocumentMatrix may be there is an
option for ask only the word, but I don't find it.  

For the moment I treat this probleme with my little fingers. I open all the
texts with word to ellimanted all the bracket with a small macro. But it's
not an easy way with much undred texts in my corpus. 
For plural I take the word with or without s and i make the difference.
Fortunaltly, I wish to conserve only 40 more meagningfull words of the
corpus.
I know what kind of improvement could be done but I m just a user not an
ingeneer. I think little improvements could be realize by the magical
ingeneer wich work for the communauty as I try modestly with my comments.
Thank's for all,
Mickaël   

--
View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4442728.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list