[BioC] DEXSeq many exons marked as not testable
Alejandro Reyes
alejandro.reyes at embl.de
Wed Sep 19 18:20:50 CEST 2012
Dear Stefan Dentro,
Thanks for your email. It actually looks strange, could you send the
output of the function "counts", and subset for this specific gene?
Could you also do it for the "fData" function??
Alejandro
> Hello,
>
> I'm using DEXSeq to obtain differentially expressed exons between my 15
> samples and 8 controls. But for some reason most exons are not testable and
> are not assigned a p-value. Do you know what I'm doing wrong?
>
> First I've created a conservative list of exons per gene in the human
> genome through the canonical transcripts table in UCSC (194511 in total).
> Next I've obtained read counts for each exon in each sample which are read
> into one big data.frame. Then the following list of commands are executed:
>
> metadata = data.frame(
> row.names=colnames(dat)[7:29],
> condition=c(rep("control", 8), rep("data", 15)),
> replicate=c(c(1:8), c(1:15)),
> libType=rep("paired-end", 23))
>
> ## Add strand information that is not available in the read counts
> data.frame
> dat = cbind(dat, strand=rep(NA, nrow(dat)))
>
> cds = newExonCountSet(countData = dat[,c(7:29)], design = metadata,
> geneIDs = dat[,5], exonIDs = dat[,6],
> exonIntervals=dat[,c(2,3,4, 30)])
>
> cds = estimateSizeFactors(cds)
> cds = estimateDispersions(cds, minCount=2, maxExon=90)
> cds = fitDispersionFunction(cds)
> cds = testForDEU(cds)
> cds = estimatelog2FoldChanges(cds)
>
> Now lets see how many exons are expressed significantly different between
> data and controls. Surprisingly only 1850 exons are described here, not the
> full 194511:
>
> res1 = DEUresultTable(cds)
> table(res1$padjust < 0.05)
>
> FALSE TRUE
> 1424 426
>
> So I've zoomed in to an example gene. Below it can be seen that indeed all
> exons are marked as not testable. But looking at the individual read counts
> per sample (see further below) it does not make sense that they are not
> testable. Some exons indeed have a low read count and are excluded
> rightfully, but others have enough reads to base dispersion on I would
> think.
>
> If I have understood correctly testable can be false in either of these
> cases:
> - Gene has only one exon, dispersion cannot be calculated.
> - Readcounts for an exon are too low.
> - A combination of the above.
>
> But all do not apply here. Any ideas?
>
>
> fData for the gene:
> geneID exonID testable dispBeforeSharing dispFitted
> dispersion pvalue padjust chr start end strand transcripts
> 136687 ENSG00000009780 1 FALSE NA 0.2026585
> 0.2026585 NA NA 1 28052568 28052672 <NA> <NA>
> 108973 ENSG00000009780 2 FALSE NA 0.9549670
> 0.9549670 NA NA 1 28053983 28054047 <NA> <NA>
> 1100 ENSG00000009780 3 FALSE NA 1.9560912
> 1.9560912 NA NA 1 28056743 28056844 <NA> <NA>
> 1 ENSG00000009780 4 FALSE NA 0.4722995
> 0.4722995 NA NA 1 28059114 28059168 <NA> <NA>
> 22096 ENSG00000009780 5 FALSE NA 0.1146813
> 0.1146813 NA NA 1 28060542 28060694 <NA> <NA>
> 13670 ENSG00000009780 6 FALSE NA 0.1127563
> 0.1127563 NA NA 1 28071165 28071322 <NA> <NA>
> 13666 ENSG00000009780 7 FALSE NA 0.1450013
> 0.1450013 NA NA 1 28075579 28075665 <NA> <NA>
> 22095 ENSG00000009780 8 FALSE NA 0.1160550
> 0.1160550 NA NA 1 28081706 28081841 <NA> <NA>
> 13665 ENSG00000009780 9 FALSE NA 0.1129432
> 0.1129432 NA NA 1 28086037 28086138 <NA> <NA>
> 138629 ENSG00000009780 10 FALSE NA 0.1112855
> 0.1112855 NA NA 1 28087006 28087092 <NA> <NA>
>
> Read counts for the gene per sample:
> exon_id chr start end gene_id exon_nr
> sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
> 1 ENSE00000327880 1 28059114 28059168 ENSG00000009780 4
> 11 8 4 7 3 1 8 3
> 1100 ENSE00000571221 1 28056743 28056844 ENSG00000009780 3
> 1 0 1 1 0 0 0 1
> 13665 ENSE00000761751 1 28086037 28086138 ENSG00000009780 9 91
> 145 86 47 169 36 231 136
> 13666 ENSE00000761753 1 28075579 28075665 ENSG00000009780 7 19
> 62 32 18 44 23 61 38
> 13670 ENSE00000761780 1 28071165 28071322 ENSG00000009780 6 77
> 139 73 66 78 30 168 88
> 22095 ENSE00000866714 1 28081706 28081841 ENSG00000009780 8 78
> 94 52 47 40 41 121 60
> 22096 ENSE00000866716 1 28060542 28060694 ENSG00000009780 5 36
> 61 50 63 67 10 129 69
> 108973 ENSE00001625313 1 28053983 28054047 ENSG00000009780 2
> 3 5 3 0 1 0 4 2
> 136687 ENSE00001909353 1 28052568 28052672 ENSG00000009780 1 2
> 12 5 0 19 0 34 10
> 138629 ENSE00001955917 1 28087006 28087092 ENSG00000009780 10
> 42 96 50 24 98 23 117 79
> sample9 sample10 sample11 sample12 sample13 sample14 sample15
> sample16 sample17 sample18 sample19 sample20 sample21 sample22
> 1 1 1 0 3 4 1 1
> 0 3 7 0 0 1 0
> 1100 2 0 0 0 0 1 1
> 0 0 3 1 0 0 0
> 13665 78 64 130 55 66 27 75
> 53 49 126 36 37 48 53
> 13666 7 20 51 6 43 8 15
> 19 11 26 7 15 11 5
> 13670 75 85 186 79 86 58 70
> 86 47 149 82 49 66 38
> 22095 64 64 149 51 91 37 67
> 54 58 148 54 41 64 43
> 22096 90 79 287 66 97 29 75
> 92 58 126 67 55 71 51
> 108973 1 0 0 1 1 0 1
> 1 0 4 0 0 0 0
> 136687 5 28 31 6 27 5 4
> 12 5 32 6 7 9 16
> 138629 119 111 233 100 129 59 141
> 103 70 219 92 60 74 67
> sample23 strand
> 1 0 NA
> 1100 0 NA
> 13665 24 NA
> 13666 7 NA
> 13670 27 NA
> 22095 25 NA
> 22096 21 NA
> 108973 1 NA
> 136687 1 NA
> 138629 54 NA
>
> Design of the experiment:
> condition replicate libType
> sample1 control 1 paired-end
> sample2 control 2 paired-end
> sample3 control 3 paired-end
> sample4 control 4 paired-end
> sample5 control 5 paired-end
> sample6 control 6 paired-end
> sample7 control 7 paired-end
> sample8 control 8 paired-end
> sample9 data 1 paired-end
> sample10 data 2 paired-end
> sample11 data 3 paired-end
> sample12 data 4 paired-end
> sample13 data 5 paired-end
> sample14 data 6 paired-end
> sample15 data 7 paired-end
> sample16 data 8 paired-end
> sample17 data 9 paired-end
> sample18 data 10 paired-end
> sample19 data 11 paired-end
> sample20 data 12 paired-end
> sample21 data 13 paired-end
> sample22 data 14 paired-end
> sample23 data 15 paired-end
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
> LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
> [6] LC_MESSAGES=C LC_PAPER=C LC_NAME=C
> LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] plyr_1.7.1 DEXSeq_1.2.1 BiocInstaller_1.4.3
> DESeq_1.8.3 locfit_1.5-7 Biobase_2.16.0
> [7] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.34.0 AnnotationDbi_1.18.0 biomaRt_2.12.0
> DBI_0.2-5 genefilter_1.38.0 geneplotter_1.34.0
> [7] grid_2.15.0 hwriter_1.3 IRanges_1.14.2
> lattice_0.20-6 RColorBrewer_1.0-5 RCurl_1.91-1
> [13] RSQLite_0.11.1 splines_2.15.0 statmod_1.4.15
> stats4_2.15.0 stringr_0.6 survival_2.36-12
> [19] tools_2.15.0 XML_3.9-4 xtable_1.7-0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list