[RsR] Robust estimation of word frequencies

Tue Jul 7 14:06:43 CEST 2015

Hello,

My question is about applying robust location and scale estimates to 
word frequencies.  Some words are prone to frequency spikes in a small 
number of documents; there was a paper showing that if the probability 
of seeing a word like 'gastric' in a document is /p/, then the 
probability of seeing its second occurrence in a document is close to 
/p/2/ rather than the expected /p^2/, so traditional stats overestimate 
the frequency of such words.

I want to experiment with robust statistics on word frequency lists, but 
here I come across a problem that most words do not occur in most of the 
documents, so that their medians and MADs are zero.  In my sample 
dataset the only word with the non-zero median frequency is the word 
/correct/ (as an adjective). Here are some examples:

>load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
> summary(vocab)
    correct_J         correct_V         gastric_J            moon_N
  Min.   :   0.00   Min.   :   0.00   Min.   :   0.000   Min.   :   0.00
  1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:   0.00
  Median :  33.30   Median :   0.00   Median :   0.000   Median :   0.00
  Mean   :  79.81   Mean   :  33.94   Mean   :   5.766   Mean   :  27.75
  3rd Qu.:  91.09   3rd Qu.:  27.96   3rd Qu.:   0.000   3rd Qu.:  18.55
  Max.   :3105.59   Max.   :3897.79   Max.   :2957.449   Max.   :4143.39
      moon_V         thoroughly_R      toothbrush_N
  Min.   : 0.0000   Min.   :   0.00   Min.   :   0.000
  1st Qu.: 0.0000   1st Qu.:   0.00   1st Qu.:   0.000
  Median : 0.0000   Median :   0.00   Median :   0.000
  Mean   : 0.3154   Mean   :  28.64   Mean   :   2.894
  3rd Qu.: 0.0000   3rd Qu.:  30.02   3rd Qu.:   0.000
  Max.   :79.3730   Max.   :2028.40   Max.   :1046.025

The rows correspond to a measure of word frequencies in each document in 
a collection.

This mailing list had a couple of suggestions on a similar topic:
https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
suggesting the use of huberM, but the huberM estimates for this dataset 
are still zeros, while the use of

huberM(a,s=mean(abs(a - median(a))))

doesn't help since the medians are zero, so s is reduced to mean(a).

Trimmed means definitely help, but we need a principled way to estimate 
the amount of trimming, since /gastric /occurs in just 65 documents out 
of 4054, while /moon /as a verb in 32 documents, necessitating a very 
low trimming threshold to avoid ignoring such words altogether.

Does robustbase offer any more principled way for estimation of location 
and its confidence intervals in such cases?

Best,
Serge

	[[alternative HTML version deleted]]