[R] Removing words and initials with tm
Sun Shine
phaedrusv at gmail.com
Fri Apr 10 12:19:51 CEST 2015
Hi list
Using the tm package, part of the pre-processing work is to remove
words, etc. from the corpus.
I wish to remove people's names and also their initials which are
peppered throughout the corpus. But, because some people's initials are
the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or
'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a
completely different meaning).
Is there any way of doing this without leaving a trail of nonsense
half-terms behind? I suspect that it might have something to do with
regular expressions, but to be honest, I'm (currently) pretty crap with
those.
Would it make a difference if I removed initials and names *prior* to
converting all text to lower case, so I remove 'AM' and because 'became'
is lower case, it should remain unaffected?
Any recommendations on how best to proceed with this?
Thanks as always.
Sun
More information about the R-help
mailing list