[BioC] Statistics for next-generation sequencing transcriptomics

Zeljko Debeljak zeljko.debeljak at gmail.com
Sat Jul 25 10:11:06 CEST 2009


Has anyone considered the possibility to use some resampling
statistics concepts for these purposes? I have found some interesting
alternatives to Chi2 and similar tests based on the resampling
approach in the book "Data Analysis by Resampling: Concepts and
Applications" by C.E. Lunneborg. It seems that such approaches are
"resilient" to the type of the underlying distribution. I must say I
didn't use these methods for this particular type of problem but I do
have some positive experience with other applications of resampling in
the statistical hypothesis testing field.

Zeljko Debeljak, PhD
Medical Biochemistry Specialist
Osijek Clinical Hospital
CROATIA

2009/7/25, Margaret Taub <mtaub at stat.berkeley.edu>:
>
> I've recently spent some time investigating this topic and on the data
> set I have examined, where a Poisson model seems to hold very well
> across technical replicate lanes (based on chi-squared goodness-of-fit
> statistics), Fisher's exact test, the likelihood ratio test, and the
> binomial test all perform equivalently in terms of identifying
> differentially expressed genes and in terms of producing very low p-
> values.  I also compared against a recently developed method for SAGE
> analysis, the edgeR package of Robinson and Smyth, but perhaps due to
> the almost exact Poisson variability in my data, their method did not
> perform better than the others. (The method is designed to work
> particularly well on overdispersed Poisson data.)  And using just
> mapped reads for the lane totals (suggested in this thread by Michael
> Dondrup) did not have much effect - the counts were still very high.
>
> I think overall that Naomi is right - in this Poisson context,
> frequentist methods will produce very small p-values.  I do think that
> part of the problem is that most RNASeq data sets produced today are
> not capturing biological variability, so we are really just doing a
> one-versus-one comparison even when we have "replicate" lanes (which
> are almost exclusively technical replicates, or at best replicates on
> cell lines) and hopefully once this is taken into account more
> reasonable test statistics will be seen.
>
> I also agree with John Herbert's point about transcript length having
> an effect here, and I'm not sure that a good solution has been
> developed for this yet either, see Oshlack and Wakefield's recent
> paper for a discussion:
> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&tool=pmcentrez
> .  They point out that for Poisson data, dividing by transcript length
> does not fix the length-bias problem.
>
> Cheers,
> Margaret
>
>
> Margaret Taub
> PhD Candidate
> UC Berkeley Dept of Statistics
>
>
>
>>> From: Zeljko Debeljak <zeljko.debeljak at gmail.com>
>>> Date: July 24, 2009 9:21:40 AM PDT
>>> To: Michael Dondrup <Michael.Dondrup at bccs.uib.no>
>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Subject: Re: [BioC] Statistics for next-generation sequencing
>>> transcriptomics
>>>
>>> Dear all,
>>>
>>> according to the described problem maybe binomial test could be
>>> designed for this purpose? Just a thought.
>>>
>>> Zeljko Debeljak, PhD
>>> Medical Biochemistry Specialist
>>> Osijek Clinical Hospital
>>> CROATIA
>>>
>>> 2009/7/24, Michael Dondrup <Michael.Dondrup at bccs.uib.no>:
>>>> Hi again,
>>>>
>>>> I see the point now :)
>>>> Just another idea, did you take the number of all reads sequenced as
>>>> a total, or the number of all reads successfully mapped to the
>>>> genome/
>>>> reference sequence? What happens, when only the mapped reads are
>>>> counted?
>>>> I suspect that the choice of this could influence whatever method is
>>>> applied, because large fractions of unmapped reads (given there are
>>>> such) could -artificially?- increase the confidence in the result.
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C):
>>>>
>>>>> Hi Michael
>>>>>
>>>>> No, you're not missing anything, I wrote down my example
>>>>> incorrectly.  I wrote down the elements of the contingency table
>>>>> rather than the totals, so it should have been:
>>>>>
>>>>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2)
>>>>>> mat
>>>>>     [,1]   [,2]
>>>>> [1,]  22000  48000
>>>>> [2,] 260000 507000
>>>>>> fisher.test(mat)
>>>>>
>>>>>      Fisher's Exact Test for Count Data
>>>>>
>>>>> data:  mat
>>>>> p-value < 2.2e-16
>>>>> alternative hypothesis: true odds ratio is not equal to 1
>>>>> 95 percent confidence interval:
>>>>> 0.8789286 0.9087655
>>>>> sample estimates:
>>>>> odds ratio
>>>>> 0.8937356
>>>>>
>>>>> Sorry about that!
>>>>>
>>>>> This is a case where I suspect there is a real difference, as the
>>>>> relative frequency rises from 0.084 to 0.094.  However, as I
>>>>> mentioned, this result is masked by all the other "significant"
>>>>> results.  As Naomi says, it is because as the sample size gets
>>>>> larger, we have the power to detect tiny changes as significant.
>>>>>
>>>>> So what is the solution?
>>>>>
>>>>> John Herbert has suggested something, and I will try that.
>>>>>
>>>>> Thanks
>>>>> Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no]
>>>>> Sent: 24 July 2009 15:00
>>>>> To: michael watson (IAH-C)
>>>>> Cc: bioconductor at stat.math.ethz.ch
>>>>> Subject: Re: [BioC] Statistics for next-generation sequencing
>>>>> transcriptomics
>>>>>
>>>>> Hi Michael,
>>>>> I am having a very similar problem using 454 seq data, so I am very
>>>>> much interested in this discussion. However,  I do not quite
>>>>> understand how to
>>>>> for the contigency table and to achieve such small p-value here. My
>>>>> naive approach would be to count hits to GeneA and to count hits to
>>>>> the rest of the genome (all - #hits to gene A), giving a pretty
>>>>> much
>>>>> unbalanced 2x2 table like this:
>>>>>> mat
>>>>>        Sample.1 Sample.2
>>>>> Gene.A      22000    43000
>>>>> The.rest   238000   464000
>>>>> but then I do not see the point here, because there is a large p
>>>>> value, as I would expect:
>>>>>
>>>>>> fisher.test(mat)
>>>>>
>>>>> 	Fisher's Exact Test for Count Data
>>>>>
>>>>> data:  mat
>>>>> p-value = 0.7717
>>>>> alternative hypothesis: true odds ratio is not equal to 1
>>>>> 95 percent confidence interval:
>>>>> 0.9805937 1.0145920
>>>>> sample estimates:
>>>>> odds ratio
>>>>> 0.9974594
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> Best
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C):
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I'd like to have a discussion about statistics for transcriptomics
>>>>>> using next-generation sequencing (if there hasn't already been
>>>>>> one -
>>>>>> if there has, then please someone point me to it!)
>>>>>>
>>>>>> What we're seeing in the literature, and here at IAH, are datasets
>>>>>> where someone has sequenced the transcriptome of two samples using
>>>>>> something like Illumina.  These have been mapped to known
>>>>>> sequences
>>>>>> and counts produced.
>>>>>>
>>>>>> So what we have is something like this:
>>>>>>
>>>>>> geneA: 22000 sequences from 260000 match in sample 1, 43000
>>>>>> sequences from 507000 in sample 2.
>>>>>>
>>>>>> It's been suggested that one possible approach would be to
>>>>>> construct
>>>>>> 2x2 contingency tables and perform Fisher's exact test or the Chi-
>>>>>> squared test, as has been applied to SAGE data.
>>>>>> However, I've found that when I do that, the p-values for this
>>>>>> type
>>>>>> of data are incredibly, incredibly small, such that over 90% of my
>>>>>> data points are significant, even after adjusting for multiple
>>>>>> testing.  I assume/hope that this is because these tests were not
>>>>>> designed to cope with this type of data.
>>>>>>
>>>>>> For instance, applying Fisher's test to the example above yields
>>>>>> a p-
>>>>>> value of 3.798644e-23.
>>>>>>
>>>>>> As I see it there are three possibilities:
>>>>>> 1) I'm doing something wrong
>>>>>> 2) These tests are totally inappropriate for this type of data
>>>>>> 3) All of my data points are highly significantly different
>>>>>>
>>>>>> I'm thinking that 2 is probably true, though I wouldn't rule out
>>>>>> 1.
>>>>>>
>>>>>> Any thoughts and comments are very welcome,
>>>>>>
>>>>>> Mick
>>>>>>
>>>>>>
>>>>>> 	[[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>>
>>>>
>>>> Michael Dondrup, Ph.D.
>>>> Bergen Center for Computational Science
>>>> Computational Biology Unit
>>>> Unifob AS - Thormøhlensgate 55, N-5008 Bergen, Norway
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list