[RsR] Robust estimation of word frequencies

Fri Jul 10 13:01:01 CEST 2015

Thanks Matthias.  

So far I had no time to study the ideas behind your packages.  I just applied 
them as a black box. RobLox predictively complains about mad=0 and suggests 
the means and the confidence interval as:
 > confint(roblox(vocab$gastric_J))
A[n] asymptotic (LAN-based) confidence interval:
              2.5 %       97.5 %
mean -6.796098e-07 6.796510e-07
sd    1.503699e-05 1.604802e-05

 > confint(roblox(vocab$toothbrush_N))
A[n] asymptotic (LAN-based) confidence interval:
              2.5 %       97.5 %
mean -9.553116e-07 9.554188e-07
sd    2.113769e-05 2.255891e-05

Neither 'gastric' nor 'toothbrush' deserve a place to be in English ;-)

Probably I wasn't clear in my first message.  The zeros are not the outliers.  
However, from the viewpoint of any robust method which I tried the word 
frequencies are dominated by zeros, therefore the robust estimate of frequency 
becomes zero (with the exception of the top 2-3,000 most common words).

I can formulate my problem as follows, using 'gastric' and 'toothbrush' as 
examples, but this applies to all other words. Most of the words don't occur 
in most of the documents.  A word like 'gastric' occurs only in 65 texts, 
'toothbrush' in 123 (the collection in question is the BNC, a representative 
sample of British English), i.e. I have thousands of data points to estimate 
their probability as zero.

However in the 65 texts in which 'gastric' does appear we have 55 texts, in 
which its frequency is "normal":
> summary(v.g[v.g<100])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.243  20.640  25.010  35.550  47.960  98.040 

as well as 10 texts with clearly outlying frequency values:
> summary(v.g[v.g>100])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  131.3   263.5  1021.0  1274.0  2123.0  2957.0 

(the choice of 100 in this example is almost a guess coming from looking at 
the histogram and also testing v.g with huberM).

'toothbrush' has its outliers as well, but proportionally fewer.  My intuition 
is that 'gastric' is much more prone to frequency bursts, and we need to 
exclude a larger number of outlying observations for 'gastric', so that the 
estimate of its frequency count becomes lower than the one for 'toothbrush' in 
this collection.  The outlying observations should also have no effect on the 
significance intervals for the probabilities.  

However, this analysis is based on discarding the thousands of zeros and 
making inferences on a small non-zero subset.  I'm not sure about the 
statistical implications of this approach.  What if the non-zero section only 
contains the outliers?  E.g. just ten texts that repeat a spam word thousands 
of times?

The question about the underlying distribution is also very interesting.  The 
binomial (or Poisson) distribution for the word frequencies is not 
appropriate, because the probability of seeing a word once *increases* its 
chances to be seen the second time (because a word is always indicative of 
either a topic or a genre).  Seeing a word twice is closer to p/2 rather than 
p^2 as expected in the binomial model. Given that the probabilities for words 
like 'toothbrush' are in the range of 1e-5, this is a very large difference (in 
my examples above they were multiplied by 1e6 for presentational reasons).

I'm not entirely sure about the best approximation.  One possibility is to use 
the negative binomial distribution to model the probability of the *first* 
appearance of a word in a text.  Once we know it's there, we can use another 
model to approximate the probability of seeing it there N times.  However, 
even using the negative binomial for the first appearance is likely to be far 
from a good fit because of the co-occurrence statistics, e.g., seeing words 
like 'ulcer' or 'duodenal' increases the probability of seeing 'gastric'.  I 
will experiment with the distrMod package.

Serge

On 09/07/15 19:02, Prof. Dr. Matthias Kohl wrote:
> Dear Serge,
>
> you could try my package RobLox; e.g.
> library(RobLox)
> x <- rnorm(20)
> est <- roblox(x)
> est
> confint(est)
>
> But, the idea behind these robust methods is a normal location and scale
> model surrounded by a neighborhood (Tukey's gross error model). That is,
> your data should stem from some distribution
>
> Q = (1-eps) N(mean, sd) + eps H
>
> where H is an arbitrary (unknown) probability measure (possibly a Dirac
> measure at some point). However, you only can get reasonable estimates
> of mean and sd, if eps < 0.5.
> Therefore, I would propose that you try a different (robust) model. In
> particular, I would say that the 0 values in your case are not really
> outliers in the above sense and should be modelled as well.
>
> For some other robust parametric model (e.g. Binomial or Poisson) you
> could try our packages ROptEst (see http://www.stamats.de/RRlong.pdf,
> http://arxiv.org/abs/0901.3531) and distrMod (see
> http://www.jstatsoft.org/v35/i10/).
>
> Best,
> Matthias
>
>
> Am 07.07.2015 um 14:06 schrieb Serge Sharoff:
>> Hello,
>>
>> My question is about applying robust location and scale estimates to
>> word frequencies.  Some words are prone to frequency spikes in a small
>> number of documents; there was a paper showing that if the probability
>> of seeing a word like 'gastric' in a document is /p/, then the
>> probability of seeing its second occurrence in a document is close to
>> /p/2/ rather than the expected /p^2/, so traditional stats overestimate
>> the frequency of such words.
>>
>> I want to experiment with robust statistics on word frequency lists, but
>> here I come across a problem that most words do not occur in most of the
>> documents, so that their medians and MADs are zero.  In my sample
>> dataset the only word with the non-zero median frequency is the word
>> /correct/ (as an adjective). Here are some examples:
>>
>>> load(url("http://corpus.leeds.ac.uk/serge/frq-example.Rdata"))
>>> summary(vocab)
>>      correct_J         correct_V         gastric_J            moon_N
>>    Min.   :   0.00   Min.   :   0.00   Min.   :   0.000   Min.   :   0.00
>>    1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.:   0.000   1st Qu.:   0.00
>>    Median :  33.30   Median :   0.00   Median :   0.000   Median :   0.00
>>    Mean   :  79.81   Mean   :  33.94   Mean   :   5.766   Mean   :  27.75
>>    3rd Qu.:  91.09   3rd Qu.:  27.96   3rd Qu.:   0.000   3rd Qu.:  18.55
>>    Max.   :3105.59   Max.   :3897.79   Max.   :2957.449   Max.   :4143.39
>>        moon_V         thoroughly_R      toothbrush_N
>>    Min.   : 0.0000   Min.   :   0.00   Min.   :   0.000
>>    1st Qu.: 0.0000   1st Qu.:   0.00   1st Qu.:   0.000
>>    Median : 0.0000   Median :   0.00   Median :   0.000
>>    Mean   : 0.3154   Mean   :  28.64   Mean   :   2.894
>>    3rd Qu.: 0.0000   3rd Qu.:  30.02   3rd Qu.:   0.000
>>    Max.   :79.3730   Max.   :2028.40   Max.   :1046.025
>>
>> The rows correspond to a measure of word frequencies in each document in
>> a collection.
>>
>> This mailing list had a couple of suggestions on a similar topic:
>> https://stat.ethz.ch/pipermail/r-sig-robust/2009/000284.html
>> https://stat.ethz.ch/pipermail/r-sig-robust/2011/000318.html
>> suggesting the use of huberM, but the huberM estimates for this dataset
>> are still zeros, while the use of
>>
>> huberM(a,s=mean(abs(a - median(a))))
>>
>> doesn't help since the medians are zero, so s is reduced to mean(a).
>>
>> Trimmed means definitely help, but we need a principled way to estimate
>> the amount of trimming, since /gastric /occurs in just 65 documents out
>> of 4054, while /moon /as a verb in 32 documents, necessitating a very
>> low trimming threshold to avoid ignoring such words altogether.
>>
>> Does robustbase offer any more principled way for estimation of location
>> and its confidence intervals in such cases?
>>
>> Best,
>> Serge
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-SIG-Robust using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>>
>