[BioC] edgeR: common vs tagwise dispersion
Mark Robinson
mrobinson at wehi.EDU.AU
Wed Apr 21 12:32:09 CEST 2010
Hi Ann.
See comments below.
On 2010-04-21, at 12:18 AM, Ann Hess wrote:
> I am using edgeR to look for differentially abundant “segments”
> between two groups (data generated using high throughput sequencing).
> I have 3 (pooled) biological reps per group and a total of 18760
> segments (83 rows with zero count are removed by edger).
>
> As a first approach, I used the common dispersion method and found the
> estimated common dispersion to be 0.135.
The common dispersion value shows substantial biological variation between your replicates, with a CV of nearly 40% in expression levels between samples. This is typical of what we have observed when the biological replicates are separate individuals or animals.
> After looking at the top 10
> segments, I find that there tends to be a single large value
> (different for each segment) that is bringing up the logFC.
We've seen this in other datasets also. It seems to be a type of biological variation that RNA-seq makes very evident.
> I tried using moderated tagwise dispersion (using prior.n=50 and 25)
> and found that the results are largely the same as common dispersion
> approach (not shown).
The prior.n values you're choosing are basically too large. We generally recommend around 30-50 prior df. Since you have 4 residual df per segment, this translates to prior.n=10. You could even go a little lower, but not prior.n=0.
> When I look at the tagwise dispersion values
> for the top 10 hits, I find that the estimated tagwise dispersion
> values are greater that the estimated common dispersion (not shown).
>
> To look into things further, I ran the same analysis but now with
> prior.n=0 (no moderation/squeezing). The top 10 hits are now
> completely different and the estimated tagwise dispersion values for
> the top 10 are very small. (Looking at the top 10 seems to suggest
> that I could use a Poisson distribution.)
We don't recommend no moderation. There are just too few df to estimate the dispersion reliably for individual transcripts. The top segments in such a list will naturally tend to have small dispersions, because you're essentially selecting for this. It isn't necessarily evidence that the Poisson distribution is a good fit.
> Questions:
> 1. Should I be concerned that the results are so different depending
> on whether common dispersion (almost equivalent to moderated tagwise
> dispersion) or no-moderation tagwise dispersion is used? Based on
> FDR<0.05, there is only about 10% overlap between the two approaches.
No, this is to be expected. You are comparing the two extremes of dispersion estimation. The reason we developed moderated methods is that no-moderation is just not reliable.
> 2. I’m not sure how to interpret the tagwise dispersion values for the
> top hits: common dispersion method picks up segments with large
> tagwise dispersion, no moderation method picks up segments with small
> tagwise dispersion.
>
> I am using edgeR_1.4.7 with R version 2.10.1.
You'd probably be better off choosing a smaller prior.n which gives you a more even handed compromise between common and tagwise dispersion.
Hope that helps.
Cheers,
Mark
>> #COMMON DISPERSION APPROACH
>> library(edgeR)
>> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description)
>> df$samples
> group lib.size
> C1 0 4488940
> C2 0 2437107
> C3 0 2600316
> T1 1 3935852
> T2 1 3806079
> T3 1 3913694
>> df <- estimateCommonDisp(df)
>> df$common.dispersion
> [1] 0.1346658
>> df.com<-exactTest(df)
> Comparison of groups: 1 – 0
>> CDtop10<-topTags(df.com)$table
>> CDtop10[,-1]
> logConc logFC PValue FDR
> 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33
> 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28
> 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26
> 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24
> 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20
> 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18
> 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15
> 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15
> 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15
> 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14
>> CDtopIDs<-as.numeric(row.names(CDtop10))
>> df$counts[CDtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 31 36 28 45 57 22440
> [2,] 77 45 61 22745 47 55
> [3,] 49 68 85 76 210 21738
> [4,] 35 3729 34 37 22 32
> [5,] 28 69 3636 25 25 89
> [6,] 4 2 3 7 25 668
> [7,] 2 3 188 1 0 2
> [8,] 51 8 23 301 296 1619
> [9,] 12 10 652 14 13 11
> [10,] 4 5 3 540 6 10
>
>
>> #TAGWISE DISPERSION APPROACH
>> fprior <- estimateSmoothing(df)
>> fprior
> [1] 6329.643
>
> #I also tried prior.n=25 and prior.n=50, but results not shown.
>> df<-estimateTagwiseDisp(df, prior.n = 0)
>> quantile(df$tagwise.dispersion)
> 0% 25% 50% 75% 100%
> 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02
>> df.tgw<-exactTest(df,common.disp=FALSE)
>> TGWtop10<-topTags(df.tgw)$table
>> TGWtop10[,-1]
> logConc logFC PValue FDR
> 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21
> 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18
> 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11
> 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11
> 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11
> 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11
> 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09
> 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09
> 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09
> 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08
>> TGWtopIDs<-as.numeric(row.names(TGWtop10))
>> df$counts[TGWtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 685 346 337 340 340 313
> [2,] 382 199 256 121 103 142
> [3,] 97 59 55 29 36 35
> [4,] 171 91 111 87 73 72
> [5,] 54 24 35 12 8 14
> [6,] 629 295 345 354 352 347
> [7,] 138 58 84 41 47 38
> [8,] 200 89 96 93 75 87
> [9,] 231 138 157 149 115 128
> [10,] 145 87 81 74 75 53
>
>> df$tagwise.dispersion[TGWtopIDs]
> [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
> [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
>> df$tagwise.dispersion[CDtopIDs]
> [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105
> [7] 3.2246046 0.6054731 2.0582920 2.1059294
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list