[R] Removing words and initials with tm

Jim Lemon drjimlemon at gmail.com
Sat Apr 11 00:36:22 CEST 2015


Hi Sun,
No, I was thinking of something like hunspell, which seems to fit into the
sort of work that you are doing.

Jim


On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com> wrote:

> Thanks Jeff.
>
> I'll add that to the ever-growing list my current studies are generating
> daily. :-)
>
> Cheers
> S
>
>
>
> On 10/04/15 14:32, Jeff Newmiller wrote:
>
>> "I suspect that it might have something to do with regular expressions,
>> but to be honest, I'm (currently) pretty crap with those."
>>
>> I cannot think of a better incentive to take action on this hole in your
>> education and buckle down to learn regular expressions. There are many
>> books and tutorials available.
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                        Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>> ------------------------------------------------------------
>> ---------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote:
>>
>>> Hi list
>>>
>>> Using the tm package, part of the pre-processing work is to remove
>>> words, etc. from the corpus.
>>>
>>> I wish to remove people's names and also their initials which are
>>> peppered throughout the corpus. But, because some people's initials are
>>>
>>> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or
>>> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has
>>> a
>>> completely different meaning).
>>>
>>> Is there any way of doing this without leaving a trail of nonsense
>>> half-terms behind? I suspect that it might have something to do with
>>> regular expressions, but to be honest, I'm (currently) pretty crap with
>>>
>>> those.
>>>
>>> Would it make a difference if I removed initials and names *prior* to
>>> converting all text to lower case, so I remove 'AM' and because
>>> 'became'
>>> is lower case, it should remain unaffected?
>>>
>>> Any recommendations on how best to proceed with this?
>>>
>>> Thanks as always.
>>> Sun
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list