[R] Using a text file as a removeWord dictionary in tm_map
Sun Shine
phaedrusv at gmail.com
Mon Mar 2 08:36:50 CET 2015
Thanks Jim.
I thought that I was passing a vector, not realising I had converted
this to a list object.
I haven't come across the scan() function so far, so this is good to know.
Good explanation - I'll give this a go when I can get back to that piece
of work later today.
Thanks again.
Regards,
Sun
On 01/03/15 21:13, jim holtman wrote:
> The 'read.table' was creating a data.frame (not a vector) and applying
> 'c' to it converted it to a list. You should alway look at the object
> you are creating. You probably want to use 'scan'.
>
> ======================
>> testFile <- "Although,this,query,applies,specifically,to,the,tm,package"
>> # read in with read.table create a data.frame
>> df_words <- read.table(text = testFile, sep = ',')
>> df_words # not a vector
> V1 V2 V3 V4 V5 V6 V7 V8 V9
> 1 Although this query applies specifically to the tm package
>> c(df_words) # this results in a list
> $V1
> [1] Although
> Levels: Although
> $V2
> [1] this
> Levels: this
> $V3
> [1] query
> Levels: query
> $V4
> [1] applies
> Levels: applies
> $V5
> [1] specifically
> Levels: specifically
> $V6
> [1] to
> Levels: to
> $V7
> [1] the
> Levels: the
> $V8
> [1] tm
> Levels: tm
> $V9
> [1] package
> Levels: package
>> # now read with 'scan'
>> scan_words <- scan(text = testFile, what = '', sep = ',')
> Read 9 items
>> scan_words
> [1] "Although" "this" "query" "applies"
> "specifically" "to"
> [7] "the" "tm" "package"
>>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote:
>> Hi list
>>
>> Although this query applies specifically to the tm package, perhaps it's
>> something that others might be able to lend a thought to.
>>
>> Using tm to do some initial text mining, I want to include an external (to
>> R) generated dictionary of words that I want removed from the corpus.
>>
>> I have created a comma separated list of terms in " " marks in a
>> stopList.txt plain UTF-8 file. I want to read this into R, so do:
>>
>>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
>> When I want to load it as part of the removeWords function in tm, I do:
>>
>>> docs <- tm_map(docs, removeWords, stopDict)
>> which has no effect. Neither does:
>>
>>> docs <- tm_map(docs, removeWords, c(stopDict))
>> What am I not seeing/ doing?
>>
>> How do I pass a text file with pre-defined terms to the removeWords
>> transform of tm?
>>
>> Thanks for any ideas.
>>
>> Cheers
>>
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list