[BioC] DEXSeq questions on power and counting bins
Wolfgang Huber
whuber at embl.de
Tue Jul 9 22:04:21 CEST 2013
Dear Gu
Thanks for the clear and insightful questions, and sorry for the delay in replying (travel time).
On Jul 2, 2013, at 8:48 pm, Gu [guest] <guest at bioconductor.org> wrote:
> I am using DEXSeq for testing differential exon usage (DEU) between two conditions, each having 3 biological replicates. The design matrix contains a covariate called "subject", i.e. the same subject had both the control and treatment. I have three questions:
>
> (1) In the vignette of DEXSeq, in Figure 3 the author compared the adjusted p-values from two tests: without the batch effect (x-axis) and with the batch effect (y-axis). Clearly from the plot, the p-values are smaller when the batch effect is accounted for. However, I don't know if we can conclude from such a plot that "with the type-aware analysis, detection power for DEU due to condition is improved"?
If we believe that type I error (as measured e.g. by FDR or FPR) is approximately the same in both cases, then a longer list of rejections means more power. You can convince yourself of the premise of this conclusion e.g. by applying the same method to data with known negative controls, to data that are actually biological replicates (so where you know that all genes are not d.e.), or to simulated data. The DEXSeq paper in Genome Research (PMID 22722343) and its supplement contain some examples of that exercise.
> It is from a real data analysis, so how do we know the significant genes are really true positives?
> BTW, in my analysis, after accounting for the subject effect, the number of genes with DEU increases from 14 to 24, and the plot for comparing the p-values are similar to Figure 3. Will properly accounting for covariates in DEXSeq always lead to such a conclusion (i.e. increased detection power)?
Yes, although this is then not more than a circular definition of "properly". "Improperly" accounting for irrelevant covariates will loose degrees of freedom and thus loose power.
> (2) I got the HTML outputs using DEXSeqHTML() in R. For each gene with DEU, I can see different plot options: counts, expression, splicing and transcripts. By only looking at the plot with differentially expressed exon(s) in color, it seems that the conclusion is only based on "splicing" as I can see the "distance" between the two conditions, but such distance is very small in other plot options. In the vignette, the author also recommend specifying "splicing=TRUE". Can I know what are the differences among those options, and which one is preferred to use (making more biological sense)?
I think you are pointing to a documentation deficit. We'll fix that. The reports should provide tooltips or otherwise explanations what these different views mean.
- counts: the raw data, for each sample
- expression: the fitted coefficients per compared condition (e.g.: treated, untreated)
- splicing: as 'expression', but after removing overall gene-level differential expression: this is the view most relevant for the interpretation of DEXSeq results, which are about changes in relative exon usage (i.e.: relative to overall gene expression)
> (3) DEXSeq's inference is based on "counting bins", i.e. not real exons but exonic regions redefined from GTF file. My question is, once I obtain a gene with DEU (ENSG00000056558), how can I know which *real exon* are deferentially used? From Ensembl website, this gene has 3 transcripts (splice variants), but in the DEXSeqHTML output, it has 11 exonic regions, and E010 shows DEU -- how can I tell E010 in this case correponds to which real exon?
>
This depends on how you created your ExonCountSet. In the case e.g. of pasillaExons, the boundaries of each bin are annotated in the featureData slot:
library(pasilla)
data(pasillaExons)
head(fData(pasillaExons)[,-(3:7)])
geneID exonID padjust chr start end strand
FBgn0000256:E001 FBgn0000256 E001 NA chr2L 3872658 3872947 -
FBgn0000256:E002 FBgn0000256 E002 NA chr2L 3873019 3873322 -
FBgn0000256:E003 FBgn0000256 E003 NA chr2L 3873385 3874395 -
FBgn0000256:E004 FBgn0000256 E004 NA chr2L 3874450 3875302 -
FBgn0000256:E005 FBgn0000256 E005 NA chr2L 3878895 3879067 -
FBgn0000256:E006 FBgn0000256 E006 NA chr2L 3879652 3880038 -
The same should be true of any procedure that creates these objects.
Best wishes
Wolfgang
> Thank you so much for your suggestions!
>
> -- output of sessionInfo():
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
> [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] DEXSeq_1.6.0 Biobase_2.20.0 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5
> [4] GenomicRanges_1.12.4 hwriter_1.3 IRanges_1.18.1
> [7] RCurl_1.95-4.1 Rsamtools_1.12.3 statmod_1.4.17
> [10] stats4_3.0.1 stringr_0.6.2 tools_3.0.1
> [13] XML_3.96-1.1 zlibbioc_1.6.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list