[R] Removing words and initials with tm

Fri Apr 10 13:17:35 CEST 2015

Hey Jim

So far I've re-run the process and sub'bed initials and proper names 
with blank space, and changed other names (including acronyms) to 
something less tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) 
*before* I converted to lower case. By and large, that seems to cut it, 
at least for my present purposes.

I don't have a workaround for your e.g. #2 though!

One really has to have a relatively decent handle on the scope of the 
variations and text content first. I'm not sure how one would do this 
kind of thing effectively on a large and unseen corpus.

Anyway, thanks for your reply and thoughts.

Sun

On 10/04/15 11:38, Jim Lemon wrote:
> Hi Sun,
> In fact, case sensitivity is the default in functions like "sub". The 
> problem may then become separating initials from acronyms if they are 
> present in the corpus:
>
> gsub("NM","","An NMR was performed on NM Jones")
> [1] "An R was performed on  Jones"
>
> How you are going to deal with names like York may also be tricky:
>
> gsub("York","","Reginald York took a holiday in New York.")
> [1] "Reginald  took a holiday in New ."
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Hi list
>
>     Using the tm package, part of the pre-processing work is to remove
>     words, etc. from the corpus.
>
>     I wish to remove people's names and also their initials which are
>     peppered throughout the corpus. But, because some people's
>     initials are the same as parts of common words - e.g. 'am' =
>     'became' => 'bec e' or 'ec' = 'because' => 'b ause' or 'ar' =
>     'arrival' => 'rival' (which has a completely different meaning).
>
>     Is there any way of doing this without leaving a trail of nonsense
>     half-terms behind? I suspect that it might have something to do
>     with regular expressions, but to be honest, I'm (currently) pretty
>     crap with those.
>
>     Would it make a difference if I removed initials and names *prior*
>     to converting all text to lower case, so I remove 'AM' and because
>     'became' is lower case, it should remain unaffected?
>
>     Any recommendations on how best to proceed with this?
>
>     Thanks as always.
>     Sun
>
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
>

	[[alternative HTML version deleted]]