[R] Removing words and initials with tm
Bob Green
bgreen at dyson.brisnet.org.au
Sat Apr 11 13:04:08 CEST 2015
Hello Sun,
The order of the TM transformations makes a lot of difference.
It isn't a shortcut, but if you identify all names you could create
your own Stop words list:
corpus <-tm_map(corpus , removeWords, c("english", " "))
In the case of York, Key Word in Context (KWIC) syntax could be used
to check how certain words are used. You could identify the words
useages you want to remove or retain and respectively rename the
relevant instances.
This is labour intensive, but Greis in his Quantitative Corpus
Linguistics, notes that sometimes time spent on trying to refine code
might be better spent on manual analysis (p164). This book includes a
KWIC type function (page 127), but I haven't been able to work out
how to modify it to read more than six words either side of the
specified word. Six should be adequate for your purpose. Jockers book
also includes a KWIC function but I don't believe it searches the
entire corpus, rather a specified text.
I recently checked and TM doesn't have a KWIC function, but for the R
talented (which excludes me) it might be possible to write one. For
example, Jim Holtman once wrote a KWIC function to identify word use
in a csv file.
Bob
More information about the R-help
mailing list