[R] Removing words and initials with tm
Sun Shine
phaedrusv at gmail.com
Sat Apr 11 08:21:40 CEST 2015
Hi Jim
The name's come up on my radar, but that's about it. I'll look into it.
Thanks for the reference.
All the best
S
On 10/04/15 23:36, Jim Lemon wrote:
> Hi Sun,
> No, I was thinking of something like hunspell, which seems to fit into
> the sort of work that you are doing.
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com
> <mailto:phaedrusv at gmail.com>> wrote:
>
> Thanks Jeff.
>
> I'll add that to the ever-growing list my current studies are
> generating daily. :-)
>
> Cheers
> S
>
>
>
> On 10/04/15 14:32, Jeff Newmiller wrote:
>
> "I suspect that it might have something to do with regular
> expressions, but to be honest, I'm (currently) pretty crap
> with those."
>
> I cannot think of a better incentive to take action on this
> hole in your education and buckle down to learn regular
> expressions. There are many books and tutorials available.
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... .....
> Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us
> <mailto:jdnewmil at dcn.davis.ca.us>> Basics: ##.#.
> ##.#. Live Go...
> Live: OO#.. Dead:
> OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#.
> rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On April 10, 2015 3:19:51 AM PDT, Sun Shine
> <phaedrusv at gmail.com <mailto:phaedrusv at gmail.com>> wrote:
>
> Hi list
>
> Using the tm package, part of the pre-processing work is
> to remove
> words, etc. from the corpus.
>
> I wish to remove people's names and also their initials
> which are
> peppered throughout the corpus. But, because some people's
> initials are
>
> the same as parts of common words - e.g. 'am' = 'became'
> => 'bec e' or
> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' =>
> 'rival' (which has
> a
> completely different meaning).
>
> Is there any way of doing this without leaving a trail of
> nonsense
> half-terms behind? I suspect that it might have something
> to do with
> regular expressions, but to be honest, I'm (currently)
> pretty crap with
>
> those.
>
> Would it make a difference if I removed initials and names
> *prior* to
> converting all text to lower case, so I remove 'AM' and
> because
> 'became'
> is lower case, it should remain unaffected?
>
> Any recommendations on how best to proceed with this?
>
> Thanks as always.
> Sun
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing
> list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>
>
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
> To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list