[R] findFreqTerms vs minDocFreq in Package 'tm'
Bettina.Gruen at jku.at
Mon Sep 12 15:13:52 CEST 2011
On 09/12/2011 04:28 PM, vioravis wrote:
> I am using 'tm' package for text mining and facing an issue with finding the
> frequently occuring terms. From the definition it appears that findFreqTerms
> and minDocFreq are equivalent commands and both tries to identify the
> documents with terms appearing more than a specified threshold. However, I
> am getting drastically different results with both. I have given the results
> from both the commands below:
> findFreqTerms identifies 3140 words that appear more than 5 times but
> minDocFreq identifies only 659 terms. Can someone please explain the reason
> for the different or whether I have misunderstood their definitions??
From the help page of termFreq:
‘minDocFreq’ An integer value. Words that appear less often
in ‘doc’ than this number are discarded. Defaults to ‘1’
(i.e., every token will be used).
The description for findFreqTerms states:
Find frequent terms in a term-document matrix.
So minDocFreq assesses how often a word appears in a document in order to decide if it should be included in the frequency vector of words for this document.
By contrast findFreqTerms focuses on the document-term matrix and determines how often the word occurs in the matrix. So in fact the whole corpus is used to decide on the frequency and if the word should be included or not.
Because one function uses frequency of words in a document, while the other uses frequency of words in the document-term matrix, they are obviously not equivalent commands. Your results indicate that 3140 words occur at least 5 times in the whole corpus, i.e., when summing over all documents. By contrasts 659 words occur at least 5 times in one single document.
Institut für Angewandte Statistik / IFAS
Johannes Kepler Universität Linz
4040 Linz, Austria
Tel: +43 732 2468-6829
Fax: +43 732 2468-6800
E-Mail: Bettina.Gruen at jku.at
More information about the R-help