[RsR] Robust estimation of word frequencies
Prof. Dr. Matthias Kohl
M@tth|@@@Koh| @end|ng |rom @t@m@t@@de
Thu Jul 9 20:02:11 CEST 2015
Dear Serge,
you could try my package RobLox; e.g.
library(RobLox)
x <- rnorm(20)
est <- roblox(x)
est
confint(est)
But, the idea behind these robust methods is a normal location and scale
model surrounded by a neighborhood (Tukey's gross error model). That is,
your data should stem from some distribution
Q = (1-eps) N(mean, sd) + eps H
where H is an arbitrary (unknown) probability measure (possibly a Dirac
measure at some point). However, you only can get reasonable estimates
of mean and sd, if eps < 0.5.
Therefore, I would propose that you try a different (robust) model. In
particular, I would say that the 0 values in your case are not really
outliers in the above sense and should be modelled as well.
For some other robust parametric model (e.g. Binomial or Poisson) you
could try our packages ROptEst (see http://www.stamats.de/RRlong.pdf,
http://arxiv.org/abs/0901.3531) and distrMod (see
http://www.jstatsoft.org/v35/i10/).
Best,
Matthias
Am 07.07.2015 um 14:06 schrieb Serge Sharoff:
> Hello,
>
> My question is about applying robust location and scale estimates to
> word frequencies. Some words are prone to frequency spikes in a small
> number of documents; there was a paper showing that if the probability
> of seeing a word like 'gastric' in a document is /p/, then the
> probability of seeing its second occurrence in a document is close to
> /p/2/ rather than the expected /p^2/, so traditional stats overestimate
> the frequency of such words.
>
> I want to experiment with robust statistics on word frequency lists, but
> here I come across a problem that most words do not occur in most of the
> documents, so that their medians and MADs are zero. In my sample
> dataset the only word with the non-zero median frequency is the word
> /correct/ (as an adjective). Here are some examples:
>
>> load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
>> summary(vocab)
> correct_J correct_V gastric_J moon_N
> Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
> 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
> Median : 33.30 Median : 0.00 Median : 0.000 Median : 0.00
> Mean : 79.81 Mean : 33.94 Mean : 5.766 Mean : 27.75
> 3rd Qu.: 91.09 3rd Qu.: 27.96 3rd Qu.: 0.000 3rd Qu.: 18.55
> Max. :3105.59 Max. :3897.79 Max. :2957.449 Max. :4143.39
> moon_V thoroughly_R toothbrush_N
> Min. : 0.0000 Min. : 0.00 Min. : 0.000
> 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
> Median : 0.0000 Median : 0.00 Median : 0.000
> Mean : 0.3154 Mean : 28.64 Mean : 2.894
> 3rd Qu.: 0.0000 3rd Qu.: 30.02 3rd Qu.: 0.000
> Max. :79.3730 Max. :2028.40 Max. :1046.025
>
> The rows correspond to a measure of word frequencies in each document in
> a collection.
>
> This mailing list had a couple of suggestions on a similar topic:
> https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
> https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
> suggesting the use of huberM, but the huberM estimates for this dataset
> are still zeros, while the use of
>
> huberM(a,s=mean(abs(a - median(a))))
>
> doesn't help since the medians are zero, so s is reduced to mean(a).
>
> Trimmed means definitely help, but we need a principled way to estimate
> the amount of trimming, since /gastric /occurs in just 65 documents out
> of 4054, while /moon /as a verb in 32 documents, necessitating a very
> low trimming threshold to avoid ignoring such words altogether.
>
> Does robustbase offer any more principled way for estimation of location
> and its confidence intervals in such cases?
>
> Best,
> Serge
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>
--
Prof. Dr. Matthias Kohl
www.stamats.de
More information about the R-SIG-Robust
mailing list