[R] Using a text file as a removeWord dictionary in tm_map

Sun Shine phaedrusv at gmail.com
Mon Mar 2 08:36:50 CET 2015


Thanks Jim.

I thought that I was passing a vector, not realising I had converted 
this to a list object.

I haven't come across the scan() function so far, so this is good to know.

Good explanation - I'll give this a go when I can get back to that piece 
of work later today.

Thanks again.

Regards,

Sun


On 01/03/15 21:13, jim holtman wrote:
> The 'read.table' was creating a data.frame (not a vector) and applying
> 'c' to it converted it to a list.  You should alway look at the object
> you are creating.  You probably want to use 'scan'.
>
> ======================
>> testFile <- "Although,this,query,applies,specifically,to,the,tm,package"
>> # read in with read.table create a data.frame
>> df_words <- read.table(text = testFile, sep = ',')
>> df_words  # not a vector
>          V1   V2    V3      V4           V5 V6  V7 V8      V9
> 1 Although this query applies specifically to the tm package
>> c(df_words)  # this results in a list
> $V1
> [1] Although
> Levels: Although
> $V2
> [1] this
> Levels: this
> $V3
> [1] query
> Levels: query
> $V4
> [1] applies
> Levels: applies
> $V5
> [1] specifically
> Levels: specifically
> $V6
> [1] to
> Levels: to
> $V7
> [1] the
> Levels: the
> $V8
> [1] tm
> Levels: tm
> $V9
> [1] package
> Levels: package
>> # now read with 'scan'
>> scan_words <- scan(text = testFile, what = '', sep = ',')
> Read 9 items
>> scan_words
> [1] "Although"     "this"         "query"        "applies"
> "specifically" "to"
> [7] "the"          "tm"           "package"
>>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote:
>> Hi list
>>
>> Although this query applies specifically to the tm package, perhaps it's
>> something that others might be able to lend a thought to.
>>
>> Using tm to do some initial text mining, I want to include an external (to
>> R) generated dictionary of words that I want removed from the corpus.
>>
>> I have created a comma separated list of terms in " " marks in a
>> stopList.txt plain UTF-8 file. I want to read this into R, so do:
>>
>>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
>> When I want to load it as part of the removeWords function in tm, I do:
>>
>>> docs <- tm_map(docs, removeWords, stopDict)
>> which has no effect. Neither does:
>>
>>> docs <- tm_map(docs, removeWords, c(stopDict))
>> What am I not seeing/ doing?
>>
>> How do I pass a text file with pre-defined terms to the removeWords
>> transform of tm?
>>
>> Thanks for any ideas.
>>
>> Cheers
>>
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list