[BioC] DEXSeq - identical p-values for all exons in a gene

Fri Jan 17 04:03:52 CET 2014

Hi Dario,

  Thanks for your reply.  Your explanation is correct and makes sense, however my problem is not that I get one p-value per exon. The problem is that the exact same p-value is reported for ALL exons in the gene (or 'NA'), which is not supposed to happen (you're supposed to get different p-values for each exon).  See below:

geneID  exonID  dispersion      pvalue  padjust meanBase        log2fold(treated/control)
1/2-SBSRNA4     E001    0.11143032705478        0.910345433347734       1       30.9067618720574        0.015363670607322
1/2-SBSRNA4     E002    0.583064733453923       0.910345433347734       1       6.62219699771489        -0.131125949462748
1/2-SBSRNA4     E003    0.767713143685937       0.910345433347734       1       2.21699055323382        0.0731263380599324
A1BG    E001    0.123758213037027       0.229711870771802       1       117.754224941408        0.249562974820941
A1BG    E002    0.110317995296594       0.229711870771802       1       152.932234776212        -0.0091268347351396
A1BG    E003    0.275216588214621       0.229711870771802       1       7.30724196291172        -0.4763859471178
A1BG    E004    0.625237938618851       0.229711870771802       1       10.2174627774744        -0.623376548608748
A1BG    E005    1e+08         NA      NA      0       0.0978995620155234
A1BG    E006    0.435477490574276       0.229711870771802       1       4.18241118313392        -0.0329868528499689

I just tried running models in the old format (as per DEXSeq paper script: ) and kind of seemed to fix the problem... (Yay!??)
Old format models:
formuladispersion <- count ~ sample + (condition + type) * exon
formula0 <- count ~ sample + type * exon + condition
formula1 <- count ~ sample + type * exon + condition * I(exon == exonID)

But I'm hesitant to change to this method permanently because a) DEXSeq has been updated since the paper was published so it makes more sense to use the current recommendations (as per vignette) than old ones, and b) the old models don't even seem to be very similar to the ones in the vignette and I don't understand how/whether the differences will affect the analysis.  Also, the vignette now uses the full model for estimateDispersions, rather than a separate formula and I don't know why or what the effect of not doing this would be, either.

For the record, the current vignette models are:
formula1 <- ~ sample + exon + type:exon + condition:exon
formula0 <- ~ sample + exon + type:exon

Cheers, Kat

________________________________________

There is supposed to be one p-value per exon, because you are testing each exon separately to see if it has a different count to what you'd expect if there was no splicing. In other words, the exons with a low p-value are the ones that are included in the transcript more or less in one condition that the other, accounting for differential expression.

Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

________________________________________
From: mailinglist.honeypot at gmail.com [mailinglist.honeypot at gmail.com] On Behalf Of Steve Lianoglou [lianoglou.steve at gene.com]
Sent: Friday, 17 January 2014 9:33 AM
To: KP [guest]
Cc: bioconductor at r-project.org list; Pillman, Katherine (Health)
Subject: Re: [BioC] DEXSeq - identical p-values for all exons in a gene

Hi Katherine,

You mention running into "the same problem" no matter how the formulas
are encoded, but I do not see anywhere in your email what this problem
actually is.

Can you please explain and show explicitly what the problem actually is?

Is an error being thrown somewhere, or?

Thanks,
-steve

On Thu, Jan 16, 2014 at 2:48 PM, KP [guest] <guest at bioconductor.org> wrote:
>
> Hi All,
>
> I have been running DEXSeq successfully on my dataset using the default, in-built linear models, but have been repeatedly running into problems when attempting to add an extra 'blocking' term to the model.
>
> I have checked the vignette and I thought my usage was correct:
>
> formulaFullModel <- ~ sample + exon + libType:exon + condition:exon
> formulaReducedModel <- ~ sample + exon + libType:exon
>
> ecs <- read.HTSeqCounts(countfiles=countFiles, design=sampleTable, flattenedfile=annotationfile)
> ecs <- estimateSizeFactors(ecs)
> ecs <- estimateDispersions(ecs, formula=formulaFullModel, nCores=cores, quiet=FALSE, file='dexseq_progress_report.txt')
> ecs<- fitDispersionFunction(ecs)
>
> ecs <- testForDEU(ecs, formula0=formulaReducedModel, formula1=formulaFullModel, nCores=cores, quiet=FALSE, file='testforDEU_progress_report.txt')
>
> To check whether my problem was related to the extra term or just manually specifying the model, I ran DEXSeq using only a basic model (no additional terms: formulaFullModel <- ~ sample + exon + condition:exon
> formulaReducedModel <- ~ sample + exon ), and had the same problem. This suggests it must be something to do with how I am specifying the model or providing it to the functions...?
>
> This is the sample table setup for the basic model:
>
>       X     countFile condition
> 1    H1   H.r1.counts   control
> 2    M1   M.r1.counts   treated
> 3  H2ac H.r2ac.counts   control
> 4  M2ac M.r2ac.counts   treated
> 5 Hshap   H.NA.counts   control
> 6 Mshap   M.NA.counts   treated
>
>
> I assume the cause is something silly on my part but I am stumped.  Any help would be greatly appreciated!
>
> Thanks!
>
> P.S. I also tried running the pasilla dataset in the same way but that time it worked, normal p-values.
>
>  -- output of sessionInfo():
>
> Sorry but this session info might not be quite right - the original run was done on a server and the workspace was saved and opened in RStudio. This is the session info from RStudio.
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BiocInstaller_1.12.0 DEXSeq_1.8.0         Biobase_2.22.0       BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>  [1] biomaRt_2.18.0       Biostrings_2.30.1    bitops_1.0-6         GenomicRanges_1.14.4 hwriter_1.3
>  [6] IRanges_1.20.6       RCurl_1.95-4.1       Rsamtools_1.14.2     statmod_1.4.18       stats4_3.0.2
> [11] stringr_0.6.2        tools_3.0.2          XML_3.98-1.1         XVector_0.2.0        zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
Steve Lianoglou
Computational Biologist
Genentech