[RsR] Robust estimation of word frequencies
Serge Sharoff
@@@h@ro|| @end|ng |rom |eed@@@c@uk
Fri Jul 10 13:01:01 CEST 2015
Thanks Matthias.
So far I had no time to study the ideas behind your packages. I just applied
them as a black box. RobLox predictively complains about mad=0 and suggests
the means and the confidence interval as:
> confint(roblox(vocab$gastric_J))
A[n] asymptotic (LAN-based) confidence interval:
2.5 % 97.5 %
mean -6.796098e-07 6.796510e-07
sd 1.503699e-05 1.604802e-05
> confint(roblox(vocab$toothbrush_N))
A[n] asymptotic (LAN-based) confidence interval:
2.5 % 97.5 %
mean -9.553116e-07 9.554188e-07
sd 2.113769e-05 2.255891e-05
Neither 'gastric' nor 'toothbrush' deserve a place to be in English ;-)
Probably I wasn't clear in my first message. The zeros are not the outliers.
However, from the viewpoint of any robust method which I tried the word
frequencies are dominated by zeros, therefore the robust estimate of frequency
becomes zero (with the exception of the top 2-3,000 most common words).
I can formulate my problem as follows, using 'gastric' and 'toothbrush' as
examples, but this applies to all other words. Most of the words don't occur
in most of the documents. A word like 'gastric' occurs only in 65 texts,
'toothbrush' in 123 (the collection in question is the BNC, a representative
sample of British English), i.e. I have thousands of data points to estimate
their probability as zero.
However in the 65 texts in which 'gastric' does appear we have 55 texts, in
which its frequency is "normal":
> summary(v.g[v.g<100])
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.243 20.640 25.010 35.550 47.960 98.040
as well as 10 texts with clearly outlying frequency values:
> summary(v.g[v.g>100])
Min. 1st Qu. Median Mean 3rd Qu. Max.
131.3 263.5 1021.0 1274.0 2123.0 2957.0
(the choice of 100 in this example is almost a guess coming from looking at
the histogram and also testing v.g with huberM).
'toothbrush' has its outliers as well, but proportionally fewer. My intuition
is that 'gastric' is much more prone to frequency bursts, and we need to
exclude a larger number of outlying observations for 'gastric', so that the
estimate of its frequency count becomes lower than the one for 'toothbrush' in
this collection. The outlying observations should also have no effect on the
significance intervals for the probabilities.
However, this analysis is based on discarding the thousands of zeros and
making inferences on a small non-zero subset. I'm not sure about the
statistical implications of this approach. What if the non-zero section only
contains the outliers? E.g. just ten texts that repeat a spam word thousands
of times?
The question about the underlying distribution is also very interesting. The
binomial (or Poisson) distribution for the word frequencies is not
appropriate, because the probability of seeing a word once *increases* its
chances to be seen the second time (because a word is always indicative of
either a topic or a genre). Seeing a word twice is closer to p/2 rather than
p^2 as expected in the binomial model. Given that the probabilities for words
like 'toothbrush' are in the range of 1e-5, this is a very large difference (in
my examples above they were multiplied by 1e6 for presentational reasons).
I'm not entirely sure about the best approximation. One possibility is to use
the negative binomial distribution to model the probability of the *first*
appearance of a word in a text. Once we know it's there, we can use another
model to approximate the probability of seeing it there N times. However,
even using the negative binomial for the first appearance is likely to be far
from a good fit because of the co-occurrence statistics, e.g., seeing words
like 'ulcer' or 'duodenal' increases the probability of seeing 'gastric'. I
will experiment with the distrMod package.
Serge
On 09/07/15 19:02, Prof. Dr. Matthias Kohl wrote:
> Dear Serge,
>
> you could try my package RobLox; e.g.
> library(RobLox)
> x <- rnorm(20)
> est <- roblox(x)
> est
> confint(est)
>
> But, the idea behind these robust methods is a normal location and scale
> model surrounded by a neighborhood (Tukey's gross error model). That is,
> your data should stem from some distribution
>
> Q = (1-eps) N(mean, sd) + eps H
>
> where H is an arbitrary (unknown) probability measure (possibly a Dirac
> measure at some point). However, you only can get reasonable estimates
> of mean and sd, if eps < 0.5.
> Therefore, I would propose that you try a different (robust) model. In
> particular, I would say that the 0 values in your case are not really
> outliers in the above sense and should be modelled as well.
>
> For some other robust parametric model (e.g. Binomial or Poisson) you
> could try our packages ROptEst (see http://www.stamats.de/RRlong.pdf,
> http://arxiv.org/abs/0901.3531) and distrMod (see
> http://www.jstatsoft.org/v35/i10/).
>
> Best,
> Matthias
>
>
> Am 07.07.2015 um 14:06 schrieb Serge Sharoff:
>> Hello,
>>
>> My question is about applying robust location and scale estimates to
>> word frequencies. Some words are prone to frequency spikes in a small
>> number of documents; there was a paper showing that if the probability
>> of seeing a word like 'gastric' in a document is /p/, then the
>> probability of seeing its second occurrence in a document is close to
>> /p/2/ rather than the expected /p^2/, so traditional stats overestimate
>> the frequency of such words.
>>
>> I want to experiment with robust statistics on word frequency lists, but
>> here I come across a problem that most words do not occur in most of the
>> documents, so that their medians and MADs are zero. In my sample
>> dataset the only word with the non-zero median frequency is the word
>> /correct/ (as an adjective). Here are some examples:
>>
>>> load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
>>> summary(vocab)
>> correct_J correct_V gastric_J moon_N
>> Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
>> 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00
>> Median : 33.30 Median : 0.00 Median : 0.000 Median : 0.00
>> Mean : 79.81 Mean : 33.94 Mean : 5.766 Mean : 27.75
>> 3rd Qu.: 91.09 3rd Qu.: 27.96 3rd Qu.: 0.000 3rd Qu.: 18.55
>> Max. :3105.59 Max. :3897.79 Max. :2957.449 Max. :4143.39
>> moon_V thoroughly_R toothbrush_N
>> Min. : 0.0000 Min. : 0.00 Min. : 0.000
>> 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
>> Median : 0.0000 Median : 0.00 Median : 0.000
>> Mean : 0.3154 Mean : 28.64 Mean : 2.894
>> 3rd Qu.: 0.0000 3rd Qu.: 30.02 3rd Qu.: 0.000
>> Max. :79.3730 Max. :2028.40 Max. :1046.025
>>
>> The rows correspond to a measure of word frequencies in each document in
>> a collection.
>>
>> This mailing list had a couple of suggestions on a similar topic:
>> https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
>> https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
>> suggesting the use of huberM, but the huberM estimates for this dataset
>> are still zeros, while the use of
>>
>> huberM(a,s=mean(abs(a - median(a))))
>>
>> doesn't help since the medians are zero, so s is reduced to mean(a).
>>
>> Trimmed means definitely help, but we need a principled way to estimate
>> the amount of trimming, since /gastric /occurs in just 65 documents out
>> of 4054, while /moon /as a verb in 32 documents, necessitating a very
>> low trimming threshold to avoid ignoring such words altogether.
>>
>> Does robustbase offer any more principled way for estimation of location
>> and its confidence intervals in such cases?
>>
>> Best,
>> Serge
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-SIG-Robust using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>>
>
More information about the R-SIG-Robust
mailing list