[R] Removing words and initials with tm
Sun Shine
phaedrusv at gmail.com
Fri Apr 10 13:37:48 CEST 2015
Thanks Jim
Can you say more about a R spell checker, or were you thinking of
opening the parsed documents in a word processor, e.g. LibreOffice?
After stemming the documents, most of the words are mangled, e.g.
'people' becomes 'peopl' so I think the spell checker would go crazy! I
think a lot of this comes down to which sequence one runs the different
transformations in.
Cheers
Sun
On 10/04/15 12:30, Jim Lemon wrote:
> Hi Sun,
> Good thinking. Looking at your reply, I realized that you may be able
> to run a spell checker over the output to pick up mangled words.
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com
> <mailto:phaedrusv at gmail.com>> wrote:
>
> Hey Jim
>
> So far I've re-run the process and sub'bed initials and proper
> names with blank space, and changed other names (including
> acronyms) to something less tricky (your e.g. #1 NMR is therefore
> "NucMagRes", etc.) *before* I converted to lower case. By and
> large, that seems to cut it, at least for my present purposes.
>
> I don't have a workaround for your e.g. #2 though!
>
> One really has to have a relatively decent handle on the scope of
> the variations and text content first. I'm not sure how one would
> do this kind of thing effectively on a large and unseen corpus.
>
> Anyway, thanks for your reply and thoughts.
>
> Sun
>
>
> On 10/04/15 11:38, Jim Lemon wrote:
>> Hi Sun,
>> In fact, case sensitivity is the default in functions like "sub".
>> The problem may then become separating initials from acronyms if
>> they are present in the corpus:
>>
>> gsub("NM","","An NMR was performed on NM Jones")
>> [1] "An R was performed on Jones"
>>
>> How you are going to deal with names like York may also be tricky:
>>
>> gsub("York","","Reginald York took a holiday in New York.")
>> [1] "Reginald took a holiday in New ."
>>
>> Jim
>>
>>
>> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com
>> <mailto:phaedrusv at gmail.com>> wrote:
>>
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to
>> remove words, etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which
>> are peppered throughout the corpus. But, because some
>> people's initials are the same as parts of common words -
>> e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b
>> ause' or 'ar' = 'arrival' => 'rival' (which has a completely
>> different meaning).
>>
>> Is there any way of doing this without leaving a trail of
>> nonsense half-terms behind? I suspect that it might have
>> something to do with regular expressions, but to be honest,
>> I'm (currently) pretty crap with those.
>>
>> Would it make a difference if I removed initials and names
>> *prior* to converting all text to lower case, so I remove
>> 'AM' and because 'became' is lower case, it should remain
>> unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org <mailto:R-help at r-project.org> mailing
>> list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
>> code.
>>
>>
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list