[R] tm package: problem of TermDocumentMatrix and minWordLength

C.H. chainsawtiney at gmail.com
Wed May 16 13:22:03 CEST 2012


Dear All,

The following code illustrate the problem.

[R code]
require(tm)
exampledoc <- c("R is good", "R is really good")
examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8")
dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1))
as.matrix(dtm)
[/R code]

The term "R" and "is" were not included in the dtm even the control
parameter minWordLength was set to 1.

    Terms
Docs good really
   1    1      0
   2    1      1

Would you reproduce this problem?

The following is my sessionInfo

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tm_0.5-7.1

loaded via a namespace (and not attached):
[1] compiler_2.15.0 slam_0.1-23     tools_2.15.0

Regards,

CH



More information about the R-help mailing list