[R] strange behavior of lchoose in combinatorics problem

Wed Jun 29 08:43:08 CEST 2016

Dear Gonçalo,

thanks for the additional information – I think I get now what you're trying to do.

> On 27 Jun 2016, at 06:35, Gonçalo Ferraz <gferraz29 at gmail.com> wrote:
> 
> probabilities in lpvec should be <=1, but it is not. The sum is something on the order of 1.48e-13.”
> It is absolutely right that lpvec doesn’t contain probabilities, it contains log-probabilities that are negative numbers. Also, clearly, the value of 1.48e-13 is <=1!
> What I meant is that I wanted to add probabilities in the log-probability space

Does that make any sense?  Adding log probabilities corresponds to taking the product of the probabilities for k = 0 .. J-1, and I can't think of a reasonable interpretation of this product. 

Or do you perform a log-space addition, so you actually compute the log of the sum of probabilities?

> and that I expected a negative value as a result. Instead, I am getting a positive value.

If you're performing log-space addition, then that's no surprise at all.  The cumulative probability for k <= J that you're computing is equal to 1 up to 43 decimal places – a difference much smaller than the rounding errors you're accumulating with your formula (which unnecessarily uses 5 terms instead of the normal 3 in lveech, making things even worse).

If you're directly adding the log probabilities, then I can't reproduce your result at all:

> sum(lpvec)
[1] -2835.86

In this case, I suspect that there's a mistake in the parts of the code you didn't show us.

> You are also right that if I wanted to test for a positive association I should add the probabilities of k>=J to obtain a p-value. This is clear to me, but I want a metric of the strength of association between animals that takes higher values when the association is stronger. A final result in the log-probability space is useful because I want to use the strength of association to build networks of individuals.

What we usually do in computational linguistics is to take -log(p-value) as a measure of association because it has the desired properties: large values indicate stronger association, and it's measured in log-probability space (again, http://www.collocations.de/AM/section3.html explains this approach near the top of the page).

This measure – and Dunning's related log-likelihood ratio – are used quite successfully for co-occurrences of words in my field.

> Finally, your idea of using Fisher’s exact test is very appealing for the simplicity and availability of a working function in R, but I am having trouble interpreting the result, or the result is different from that of my formula. Here’s a simple example:
> 
> Say N=2, N1=1, N2=1, and J=1. In this case, if the individuals are completely independent of each other, I expect to see them together with a probability of 0.5. That is the output from my function, in probability space. If I run Fisher’s exact test on the corresponding contingency table, however, I get a p-value of 1.

There are two possible outcomes of this experiment, J=0 and J=1, with equal probability 0.5.  Fisher's test computes the tail probability of the observed outcome + all equally or less probable outcomes, i.e. P(J=1) + P(J=0) = 1.

If you carry out a one-sided test, you'll get p=0.5.

> Why should this be, when the Fisher’s exact test is giving me the probability of obtaining the observed arrangement under the null hypothesis of no association between the animals.

A p-value isn't the probability of the observed arrangement, but the cumulative probability of the observed table + all more "extreme" tables that could have been observed (i.e. those which deviate further from the null hypothesis).

If you need your p-values in log-space, you can't use fisher.test() but need to compute the hypergeometric distribution directly with dhyper() / phyper().  This is easy for a one-sided test (for positive association, i.e. atlernative="greater"):

	fisher.test(ct, alternative="greater")$p.value

is the same as

	phyper(J-1, N1, N-N1, N2, lower.tail=FALSE)

You can then instruct phyper() to return log probabilities and use

	-phyper(J-1, N1, N-N1, N2, lower.tail=FALSE, log.p=TRUE)

as a measure of association strength.

Best,
Stefan