[R] tm package: problem of TermDocumentMatrix and minWordLength
C.H.
chainsawtiney at gmail.com
Wed May 16 13:22:03 CEST 2012
Dear All,
The following code illustrate the problem.
[R code]
require(tm)
exampledoc <- c("R is good", "R is really good")
examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8")
dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1))
as.matrix(dtm)
[/R code]
The term "R" and "is" were not included in the dtm even the control
parameter minWordLength was set to 1.
Terms
Docs good really
1 1 0
2 1 1
Would you reproduce this problem?
The following is my sessionInfo
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.5-7.1
loaded via a namespace (and not attached):
[1] compiler_2.15.0 slam_0.1-23 tools_2.15.0
Regards,
CH
More information about the R-help
mailing list