[R] Removing words and initials with tm

Sun Shine phaedrusv at gmail.com
Fri Apr 10 13:37:48 CEST 2015


Thanks Jim

Can you say more about a R spell checker, or were you thinking of 
opening the parsed documents in a word processor, e.g. LibreOffice?

After stemming the documents, most of the words are mangled, e.g. 
'people' becomes 'peopl' so I think the spell checker would go crazy! I 
think a lot of this comes down to which sequence one runs the different 
transformations in.

Cheers
Sun

On 10/04/15 12:30, Jim Lemon wrote:
> Hi Sun,
> Good thinking. Looking at your reply, I realized that you may be able 
> to run a spell checker over the output to pick up mangled words.
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Hey Jim
>
>     So far I've re-run the process and sub'bed initials and proper
>     names with blank space, and changed other names (including
>     acronyms) to something less tricky (your e.g. #1 NMR is therefore
>     "NucMagRes", etc.) *before* I converted to lower case. By and
>     large, that seems to cut it, at least for my present purposes.
>
>     I don't have a workaround for your e.g. #2 though!
>
>     One really has to have a relatively decent handle on the scope of
>     the variations and text content first. I'm not sure how one would
>     do this kind of thing effectively on a large and unseen corpus.
>
>     Anyway, thanks for your reply and thoughts.
>
>     Sun
>
>
>     On 10/04/15 11:38, Jim Lemon wrote:
>>     Hi Sun,
>>     In fact, case sensitivity is the default in functions like "sub".
>>     The problem may then become separating initials from acronyms if
>>     they are present in the corpus:
>>
>>     gsub("NM","","An NMR was performed on NM Jones")
>>     [1] "An R was performed on  Jones"
>>
>>     How you are going to deal with names like York may also be tricky:
>>
>>     gsub("York","","Reginald York took a holiday in New York.")
>>     [1] "Reginald  took a holiday in New ."
>>
>>     Jim
>>
>>
>>     On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com
>>     <mailto:phaedrusv at gmail.com>> wrote:
>>
>>         Hi list
>>
>>         Using the tm package, part of the pre-processing work is to
>>         remove words, etc. from the corpus.
>>
>>         I wish to remove people's names and also their initials which
>>         are peppered throughout the corpus. But, because some
>>         people's initials are the same as parts of common words -
>>         e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b
>>         ause' or 'ar' = 'arrival' => 'rival' (which has a completely
>>         different meaning).
>>
>>         Is there any way of doing this without leaving a trail of
>>         nonsense half-terms behind? I suspect that it might have
>>         something to do with regular expressions, but to be honest,
>>         I'm (currently) pretty crap with those.
>>
>>         Would it make a difference if I removed initials and names
>>         *prior* to converting all text to lower case, so I remove
>>         'AM' and because 'became' is lower case, it should remain
>>         unaffected?
>>
>>         Any recommendations on how best to proceed with this?
>>
>>         Thanks as always.
>>         Sun
>>
>>         ______________________________________________
>>         R-help at r-project.org <mailto:R-help at r-project.org> mailing
>>         list -- To UNSUBSCRIBE and more, see
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         and provide commented, minimal, self-contained, reproducible
>>         code.
>>
>>
>
>


	[[alternative HTML version deleted]]



More information about the R-help mailing list