[BioC] Statistics for next-generation sequencing transcriptomics

Fri Jul 24 17:25:18 CEST 2009

Hi Michael

No, you're not missing anything, I wrote down my example incorrectly.  I wrote down the elements of the contingency table rather than the totals, so it should have been:

> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
> mat
       [,1]   [,2]
[1,]  22000  48000
[2,] 260000 507000
> fisher.test(mat)

        Fisher's Exact Test for Count Data

data:  mat 
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 0.8789286 0.9087655 
sample estimates:
odds ratio 
 0.8937356 

Sorry about that!

This is a case where I suspect there is a real difference, as the relative frequency rises from 0.084 to 0.094.  However, as I mentioned, this result is masked by all the other "significant" results.  As Naomi says, it is because as the sample size gets larger, we have the power to detect tiny changes as significant.  

So what is the solution?

John Herbert has suggested something, and I will try that.

Thanks
Michael

-----Original Message-----
From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] 
Sent: 24 July 2009 15:00
To: michael watson (IAH-C)
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Statistics for next-generation sequencing transcriptomics

Hi Michael,
I am having a very similar problem using 454 seq data, so I am very  
much interested in this discussion. However,  I do not quite  
understand how to
for the contigency table and to achieve such small p-value here. My  
naive approach would be to count hits to GeneA and to count hits to  
the rest of the genome (all - #hits to gene A), giving a pretty much  
unbalanced 2x2 table like this:
 > mat
          Sample.1 Sample.2
Gene.A      22000    43000
The.rest   238000   464000
but then I do not see the point here, because there is a large p  
value, as I would expect:

 > fisher.test(mat)

	Fisher's Exact Test for Count Data

data:  mat
p-value = 0.7717
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.9805937 1.0145920
sample estimates:
odds ratio
  0.9974594

Am I missing something?

Best
Michael

Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):

> Hi
>
> I'd like to have a discussion about statistics for transcriptomics  
> using next-generation sequencing (if there hasn't already been one -  
> if there has, then please someone point me to it!)
>
> What we're seeing in the literature, and here at IAH, are datasets  
> where someone has sequenced the transcriptome of two samples using  
> something like Illumina.  These have been mapped to known sequences  
> and counts produced.
>
> So what we have is something like this:
>
> geneA: 22000 sequences from 260000 match in sample 1, 43000  
> sequences from 507000 in sample 2.
>
> It's been suggested that one possible approach would be to construct  
> 2x2 contingency tables and perform Fisher's exact test or the Chi- 
> squared test, as has been applied to SAGE data.
> However, I've found that when I do that, the p-values for this type  
> of data are incredibly, incredibly small, such that over 90% of my  
> data points are significant, even after adjusting for multiple  
> testing.  I assume/hope that this is because these tests were not  
> designed to cope with this type of data.
>
> For instance, applying Fisher's test to the example above yields a p- 
> value of 3.798644e-23.
>
> As I see it there are three possibilities:
> 1) I'm doing something wrong
> 2) These tests are totally inappropriate for this type of data
> 3) All of my data points are highly significantly different
>
> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>
> Any thoughts and comments are very welcome,
>
> Mick
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor