[RsR] Robust estimation of word frequencies

Thu Jul 9 20:02:11 CEST 2015

Dear Serge,

you could try my package RobLox; e.g.
library(RobLox)
x <- rnorm(20)
est <- roblox(x)
est
confint(est)

But, the idea behind these robust methods is a normal location and scale 
model surrounded by a neighborhood (Tukey's gross error model). That is, 
your data should stem from some distribution

Q = (1-eps) N(mean, sd) + eps H

where H is an arbitrary (unknown) probability measure (possibly a Dirac 
measure at some point). However, you only can get reasonable estimates 
of mean and sd, if eps < 0.5.
Therefore, I would propose that you try a different (robust) model. In 
particular, I would say that the 0 values in your case are not really 
outliers in the above sense and should be modelled as well.

For some other robust parametric model (e.g. Binomial or Poisson) you 
could try our packages ROptEst (see http://www.stamats.de/RRlong.pdf, 
http://arxiv.org/abs/0901.3531) and distrMod (see 
http://www.jstatsoft.org/v35/i10/).

Best,
Matthias

Am 07.07.2015 um 14:06 schrieb Serge Sharoff:
> Hello,
>
> My question is about applying robust location and scale estimates to
> word frequencies.  Some words are prone to frequency spikes in a small
> number of documents; there was a paper showing that if the probability
> of seeing a word like 'gastric' in a document is /p/, then the
> probability of seeing its second occurrence in a document is close to
> /p/2/ rather than the expected /p^2/, so traditional stats overestimate
> the frequency of such words.
>
> I want to experiment with robust statistics on word frequency lists, but
> here I come across a problem that most words do not occur in most of the
> documents, so that their medians and MADs are zero.  In my sample
> dataset the only word with the non-zero median frequency is the word
> /correct/ (as an adjective). Here are some examples:
>
>> load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
>> summary(vocab)
>      correct_J         correct_V         gastric_J            moon_N
>    Min.   :   0.00   Min.   :   0.00   Min.   :   0.000   Min.   :   0.00
>    1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:   0.00
>    Median :  33.30   Median :   0.00   Median :   0.000   Median :   0.00
>    Mean   :  79.81   Mean   :  33.94   Mean   :   5.766   Mean   :  27.75
>    3rd Qu.:  91.09   3rd Qu.:  27.96   3rd Qu.:   0.000   3rd Qu.:  18.55
>    Max.   :3105.59   Max.   :3897.79   Max.   :2957.449   Max.   :4143.39
>        moon_V         thoroughly_R      toothbrush_N
>    Min.   : 0.0000   Min.   :   0.00   Min.   :   0.000
>    1st Qu.: 0.0000   1st Qu.:   0.00   1st Qu.:   0.000
>    Median : 0.0000   Median :   0.00   Median :   0.000
>    Mean   : 0.3154   Mean   :  28.64   Mean   :   2.894
>    3rd Qu.: 0.0000   3rd Qu.:  30.02   3rd Qu.:   0.000
>    Max.   :79.3730   Max.   :2028.40   Max.   :1046.025
>
> The rows correspond to a measure of word frequencies in each document in
> a collection.
>
> This mailing list had a couple of suggestions on a similar topic:
> https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
> https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
> suggesting the use of huberM, but the huberM estimates for this dataset
> are still zeros, while the use of
>
> huberM(a,s=mean(abs(a - median(a))))
>
> doesn't help since the medians are zero, so s is reduced to mean(a).
>
> Trimmed means definitely help, but we need a principled way to estimate
> the amount of trimming, since /gastric /occurs in just 65 documents out
> of 4054, while /moon /as a verb in 32 documents, necessitating a very
> low trimming threshold to avoid ignoring such words altogether.
>
> Does robustbase offer any more principled way for estimation of location
> and its confidence intervals in such cases?
>
> Best,
> Serge
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>

-- 
Prof. Dr. Matthias Kohl
www.stamats.de