[R] Removing words and initials with tm
Jim Lemon
drjimlemon at gmail.com
Fri Apr 10 12:38:21 CEST 2015
Hi Sun,
In fact, case sensitivity is the default in functions like "sub". The
problem may then become separating initials from acronyms if they are
present in the corpus:
gsub("NM","","An NMR was performed on NM Jones")
[1] "An R was performed on Jones"
How you are going to deal with names like York may also be tricky:
gsub("York","","Reginald York took a holiday in New York.")
[1] "Reginald took a holiday in New ."
Jim
On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote:
> Hi list
>
> Using the tm package, part of the pre-processing work is to remove words,
> etc. from the corpus.
>
> I wish to remove people's names and also their initials which are peppered
> throughout the corpus. But, because some people's initials are the same as
> parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because'
> => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a completely
> different meaning).
>
> Is there any way of doing this without leaving a trail of nonsense
> half-terms behind? I suspect that it might have something to do with
> regular expressions, but to be honest, I'm (currently) pretty crap with
> those.
>
> Would it make a difference if I removed initials and names *prior* to
> converting all text to lower case, so I remove 'AM' and because 'became' is
> lower case, it should remain unaffected?
>
> Any recommendations on how best to proceed with this?
>
> Thanks as always.
> Sun
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list