[BioC] Statistics for next-generation sequencing transcriptomics
Michael Dondrup
Michael.Dondrup at bccs.uib.no
Fri Jul 24 16:00:03 CEST 2009
Hi Michael,
I am having a very similar problem using 454 seq data, so I am very
much interested in this discussion. However, I do not quite
understand how to
for the contigency table and to achieve such small p-value here. My
naive approach would be to count hits to GeneA and to count hits to
the rest of the genome (all - #hits to gene A), giving a pretty much
unbalanced 2x2 table like this:
> mat
Sample.1 Sample.2
Gene.A 22000 43000
The.rest 238000 464000
but then I do not see the point here, because there is a large p
value, as I would expect:
> fisher.test(mat)
Fisher's Exact Test for Count Data
data: mat
p-value = 0.7717
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9805937 1.0145920
sample estimates:
odds ratio
0.9974594
Am I missing something?
Best
Michael
Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
> Hi
>
> I'd like to have a discussion about statistics for transcriptomics
> using next-generation sequencing (if there hasn't already been one -
> if there has, then please someone point me to it!)
>
> What we're seeing in the literature, and here at IAH, are datasets
> where someone has sequenced the transcriptome of two samples using
> something like Illumina. These have been mapped to known sequences
> and counts produced.
>
> So what we have is something like this:
>
> geneA: 22000 sequences from 260000 match in sample 1, 43000
> sequences from 507000 in sample 2.
>
> It's been suggested that one possible approach would be to construct
> 2x2 contingency tables and perform Fisher's exact test or the Chi-
> squared test, as has been applied to SAGE data.
> However, I've found that when I do that, the p-values for this type
> of data are incredibly, incredibly small, such that over 90% of my
> data points are significant, even after adjusting for multiple
> testing. I assume/hope that this is because these tests were not
> designed to cope with this type of data.
>
> For instance, applying Fisher's test to the example above yields a p-
> value of 3.798644e-23.
>
> As I see it there are three possibilities:
> 1) I'm doing something wrong
> 2) These tests are totally inappropriate for this type of data
> 3) All of my data points are highly significantly different
>
> I'm thinking that 2 is probably true, though I wouldn't rule out 1.
>
> Any thoughts and comments are very welcome,
>
> Mick
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list