[RsR] Robust estimation of word frequencies
Serge Sharoff
@@@h@ro|| @end|ng |rom |eed@@@c@uk
Tue Jul 7 14:06:43 CEST 2015
Hello,
My question is about applying robust location and scale estimates to
word frequencies. Some words are prone to frequency spikes in a small
number of documents; there was a paper showing that if the probability
of seeing a word like 'gastric' in a document is /p/, then the
probability of seeing its second occurrence in a document is close to
/p/2/ rather than the expected /p^2/, so traditional stats overestimate
the frequency of such words.
I want to experiment with robust statistics on word frequency lists, but
here I come across a problem that most words do not occur in most of the
documents, so that their medians and MADs are zero. In my sample
dataset the only word with the non-zero median frequency is the word
/correct/ (as an adjective). Here are some examples:
>load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
> summary(vocab)
correct_J correct_V gastric_J moon_N
Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
Median : 33.30 Median : 0.00 Median : 0.000 Median : 0.00
Mean : 79.81 Mean : 33.94 Mean : 5.766 Mean : 27.75
3rd Qu.: 91.09 3rd Qu.: 27.96 3rd Qu.: 0.000 3rd Qu.: 18.55
Max. :3105.59 Max. :3897.79 Max. :2957.449 Max. :4143.39
moon_V thoroughly_R toothbrush_N
Min. : 0.0000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.0000 Median : 0.00 Median : 0.000
Mean : 0.3154 Mean : 28.64 Mean : 2.894
3rd Qu.: 0.0000 3rd Qu.: 30.02 3rd Qu.: 0.000
Max. :79.3730 Max. :2028.40 Max. :1046.025
The rows correspond to a measure of word frequencies in each document in
a collection.
This mailing list had a couple of suggestions on a similar topic:
https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
suggesting the use of huberM, but the huberM estimates for this dataset
are still zeros, while the use of
huberM(a,s=mean(abs(a - median(a))))
doesn't help since the medians are zero, so s is reduced to mean(a).
Trimmed means definitely help, but we need a principled way to estimate
the amount of trimming, since /gastric /occurs in just 65 documents out
of 4054, while /moon /as a verb in 32 documents, necessitating a very
low trimming threshold to avoid ignoring such words altogether.
Does robustbase offer any more principled way for estimation of location
and its confidence intervals in such cases?
Best,
Serge
[[alternative HTML version deleted]]
More information about the R-SIG-Robust
mailing list