[R] Removing words and initials with tm

Sun Shine phaedrusv at gmail.com
Fri Apr 10 12:19:51 CEST 2015


Hi list

Using the tm package, part of the pre-processing work is to remove 
words, etc. from the corpus.

I wish to remove people's names and also their initials which are 
peppered throughout the corpus. But, because some people's initials are 
the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or 
'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a 
completely different meaning).

Is there any way of doing this without leaving a trail of nonsense 
half-terms behind? I suspect that it might have something to do with 
regular expressions, but to be honest, I'm (currently) pretty crap with 
those.

Would it make a difference if I removed initials and names *prior* to 
converting all text to lower case, so I remove 'AM' and because 'became' 
is lower case, it should remain unaffected?

Any recommendations on how best to proceed with this?

Thanks as always.
Sun



More information about the R-help mailing list