[R] How to do poisson distribution test like this?

Wed Jul 29 05:21:00 CEST 2009

I have gone to have a look at the paper. (http://
www.plantphysiol.org/cgi/content/full/148/3/1189 ) to try to work out what
they're actually doing, in the hope that I might be able to figure out their
procedure so we can give a more complete answer to the question.
Unfortunately, I'm more confused now than before.

The numbers the OP refers to are in table I:
http:// www.plantphysiol.org/cgi/content/full/148/3/1189/TBL1

The first four rows are (abbreviating the column titles):
Chromosome   length(Mb)   Obs_No._Genes   Exp_No._Genes  Distribution_Test
   LG I                32.16           36                      30              
0.137
   LG II               23.44           25                      22              
0.235
   LG III              17.45           13                      17              
0.235
   LG IV              15.08           10                      14              
0.159

They are (somehow) working out the probability above or below the observed
count, depending on wther or not the observed count happened to be above or
below the expected! This means their p-values are on average about half what
they should be. 

However, they don't seem to be using Poisson probabilities (as they claim) -
I can't reproduce the 
results with a Poisson distribution. Nor can I reproduce them with a normal
approximation, nor with a normal approximation with continuity correction.
It's not at all clear how they've got their numbers, but they definitely
don't seem to have done what they claim to have done (see below).

The note under Table 1 says (using some sort of "pseudo" LaTeX to indicate
the greek letters)
"for distribution test $P(m(i,j) < \lambda(i,j)) <= \alpha$ or  $P(m(i,j) >
\lambda(i,j)) < \alpha$, a Poisson distribution was used to determine the
significance of the F-box gene distribution in the Populus genome"

-- that is, if they did what this says, they're working out probabilities of
MORE extreme, not "at least as extreme", a second error.

[Note that (it's stated later) the expected numbers are estimated. The don't
seem to take this into account (though the effect may be small). They're
also doing a whole bunch of tests in this paper - 19 "p-values" in table 1,
another 273 in table 3, 18 more in table 4, 28 more in table 5, another 24
in table 7 ... and so on.]

Under the "Materials and Methods" section, subsection "Localization of F-Box
Genes in the Genome", they claim that the "probabilities $P(m(i) <
\lambda(i))$ and $P(m(i) > \lambda(i))$ were evaluated under the cumulative
Posson distribution at \alpha <= 0.05 and  \alpha <= 0.01 significance
levels."

Which is weird, because the actual p-values they seem to regard as worth
mentioning in the paper are the ones at or below 0.0001

[Is this sort of thing pretty typical for papers in biology? Don't referees
even do a basic check of one or two numbers? Apparently I put too much
effort into refereeing.]

Can anyone see what they're actually doing?
-- 
View this message in context: http://www.nabble.com/How-to-do-poisson-distribution-test-like-this--tp24696413p24711764.html
Sent from the R help mailing list archive at Nabble.com.