[R] Using a text file as a removeWord dictionary in tm_map

Jim Holtman jholtman at gmail.com
Tue Mar 3 18:04:37 CET 2015


Send me a copy of your file so I can see what it looks like and what the output should be.


Sent from my Verizon Wireless 4G LTE Smartphone

<div>-------- Original message --------</div><div>From: Sun Shine <phaedrusv at gmail.com> </div><div>Date:03/03/2015  09:43  (GMT-05:00) </div><div>To: jim holtman <jholtman at gmail.com> </div><div>Cc: r-help <r-help at r-project.org> </div><div>Subject: Re: [R] Using a text file as a removeWord dictionary in tm_map </div><div>
</div>Hi again

I've now had the chance to try this out, and using scan() doesn't seem 
to work either.

This is what I used:

1) I generated a plain text file called stopDict.txt. This file is of 
the format: "a, bunch, of, words, to, use"

2) I invoked scan(), like this:
> userStopList <- scan(text = '~/path/to/stopDict.txt', what = " ", sep 
= ",")

3) Then I used the externally generated list as stop words:
> docs <- tm_map(docs, removeWords, userStopList)

3) When I go to inspect the document, at least two of the user-defined 
stop words are in the text

Is there a further argument I should be passing to scan(), or is the 
stopDict.txt file not set up the correct way? I tried each term 
separated by ' ' and ',', (e.g. 'all', 'the', 'text') but that didn't 
work, neither does it seem to work when the whole list is enclosed 
within quotes (e.g. "all, the, text").

While not critical to have the capacity to read in an externally 
generated list, it sure would be helpful.

Thanks.

Sun


On 02/03/15 07:36, Sun Shine wrote:
> Thanks Jim.
>
> I thought that I was passing a vector, not realising I had converted 
> this to a list object.
>
> I haven't come across the scan() function so far, so this is good to 
> know.
>
> Good explanation - I'll give this a go when I can get back to that 
> piece of work later today.
>
> Thanks again.
>
> Regards,
>
> Sun
>
>
> On 01/03/15 21:13, jim holtman wrote:
>> The 'read.table' was creating a data.frame (not a vector) and applying
>> 'c' to it converted it to a list.  You should alway look at the object
>> you are creating.  You probably want to use 'scan'.
>>
>> ======================
>>> testFile <- 
>>> "Although,this,query,applies,specifically,to,the,tm,package"
>>> # read in with read.table create a data.frame
>>> df_words <- read.table(text = testFile, sep = ',')
>>> df_words  # not a vector
>>          V1   V2    V3      V4           V5 V6  V7 V8      V9
>> 1 Although this query applies specifically to the tm package
>>> c(df_words)  # this results in a list
>> $V1
>> [1] Although
>> Levels: Although
>> $V2
>> [1] this
>> Levels: this
>> $V3
>> [1] query
>> Levels: query
>> $V4
>> [1] applies
>> Levels: applies
>> $V5
>> [1] specifically
>> Levels: specifically
>> $V6
>> [1] to
>> Levels: to
>> $V7
>> [1] the
>> Levels: the
>> $V8
>> [1] tm
>> Levels: tm
>> $V9
>> [1] package
>> Levels: package
>>> # now read with 'scan'
>>> scan_words <- scan(text = testFile, what = '', sep = ',')
>> Read 9 items
>>> scan_words
>> [1] "Although"     "this"         "query"        "applies"
>> "specifically" "to"
>> [7] "the"          "tm"           "package"
>>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>>
>> On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote:
>>> Hi list
>>>
>>> Although this query applies specifically to the tm package, perhaps 
>>> it's
>>> something that others might be able to lend a thought to.
>>>
>>> Using tm to do some initial text mining, I want to include an 
>>> external (to
>>> R) generated dictionary of words that I want removed from the corpus.
>>>
>>> I have created a comma separated list of terms in " " marks in a
>>> stopList.txt plain UTF-8 file. I want to read this into R, so do:
>>>
>>>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
>>> When I want to load it as part of the removeWords function in tm, I do:
>>>
>>>> docs <- tm_map(docs, removeWords, stopDict)
>>> which has no effect. Neither does:
>>>
>>>> docs <- tm_map(docs, removeWords, c(stopDict))
>>> What am I not seeing/ doing?
>>>
>>> How do I pass a text file with pre-defined terms to the removeWords
>>> transform of tm?
>>>
>>> Thanks for any ideas.
>>>
>>> Cheers
>>>
>>> Sun
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>


	[[alternative HTML version deleted]]



More information about the R-help mailing list