[BioC] edgeR: common vs tagwise dispersion
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Apr 22 01:19:51 CEST 2010
Dear Ann,
On 2010-04-21, at 12:18 AM, Ann Hess wrote:
> I am using edgeR to look for differentially abundant segments between
> two groups (data generated using high throughput sequencing). I have 3
> (pooled) biological reps per group and a total of 18760 segments (83
> rows with zero count are removed by edger).
>
> As a first approach, I used the common dispersion method and found the
> estimated common dispersion to be 0.135.
Your common dispersion value shows substantial biological variation
between your replicates, with a CV of nearly 40% in expression levels
between samples. This is typical of what we have observed when the
biological replicates are separate individuals or animals.
> After looking at the top 10 segments, I find that there tends to be a
> single large value (different for each segment) that is bringing up the
> logFC.
We've seen this in other datasets also. It seems to be a type of
biological variation that RNA-seq makes very evident.
> I tried using moderated tagwise dispersion (using prior.n=50 and 25) and
> found that the results are largely the same as common dispersion
> approach (not shown).
The prior.n values you're choosing are basically too large. We generally
recommend around 30-50 prior df. Since you have 4 residual df per
segment, this translates to prior.n=10. You could even go a little lower,
but not prior.n=0.
> When I look at the tagwise dispersion values for the top 10 hits, I find
> that the estimated tagwise dispersion values are greater that the
> estimated common dispersion (not shown).
>
> To look into things further, I ran the same analysis but now with
> prior.n=0 (no moderation/squeezing). The top 10 hits are now completely
> different and the estimated tagwise dispersion values for the top 10 are
> very small. (Looking at the top 10 seems to suggest that I could use a
> Poisson distribution.)
We don't recommend no moderation. There are just too few df to estimate
the dispersion reliable for individual transcripts. The top segments in
such a list will naturally tend to have small dispersions, because you're
essentially selecting for this. It isn't good evidence that the Poisson
distribution is a good fit.
> Questions:
> 1. Should I be concerned that the results are so different depending
> on whether common dispersion (almost equivalent to moderated tagwise
> dispersion) or no-moderation tagwise dispersion is used? Based on
> FDR<0.05, there is only about 10% overlap between the two approaches.
No, this is to be expected. The reason we developed moderated methods is
that no-moderation is just not reliable.
> 2. Im not sure how to interpret the tagwise dispersion values for the
> top hits: common dispersion method picks up segments with large tagwise
> dispersion, no moderation method picks up segments with small tagwise
> dispersion.
In a genome wide experiment, some segments will have large and some will
have very small sample variances just by chance. Furthermore, some
features are genuinely hypervariable between individuals. The best
strategy is likely to be to choose a smaller prior.n, which will give you
a more even handed compromise between common and tagwise dispersion.
This will downweight the segments with very large variation between
individuals in the toplist, while still giving stable estimates of tagwise
dispersion.
Best wishes
Gordon
-----------------------------------------------
Associate Professor Gordon K Smyth,
NHMRC Senior Research Fellow,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
> I am using edgeR_1.4.7 with R version 2.10.1.
>
>> #COMMON DISPERSION APPROACH
>> library(edgeR)
>> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description)
>> df$samples
> group lib.size
> C1 0 4488940
> C2 0 2437107
> C3 0 2600316
> T1 1 3935852
> T2 1 3806079
> T3 1 3913694
>> df <- estimateCommonDisp(df)
>> df$common.dispersion
> [1] 0.1346658
>> df.com<-exactTest(df)
> Comparison of groups: 1 ? 0
>> CDtop10<-topTags(df.com)$table
>> CDtop10[,-1]
> logConc logFC PValue FDR
> 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33
> 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28
> 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26
> 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24
> 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20
> 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18
> 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15
> 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15
> 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15
> 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14
>> CDtopIDs<-as.numeric(row.names(CDtop10))
>> df$counts[CDtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 31 36 28 45 57 22440
> [2,] 77 45 61 22745 47 55
> [3,] 49 68 85 76 210 21738
> [4,] 35 3729 34 37 22 32
> [5,] 28 69 3636 25 25 89
> [6,] 4 2 3 7 25 668
> [7,] 2 3 188 1 0 2
> [8,] 51 8 23 301 296 1619
> [9,] 12 10 652 14 13 11
> [10,] 4 5 3 540 6 10
>
>
>> #TAGWISE DISPERSION APPROACH
>> fprior <- estimateSmoothing(df)
>> fprior
> [1] 6329.643
>
> #I also tried prior.n=25 and prior.n=50, but results not shown.
>> df<-estimateTagwiseDisp(df, prior.n = 0)
>> quantile(df$tagwise.dispersion)
> 0% 25% 50% 75% 100%
> 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02
>> df.tgw<-exactTest(df,common.disp=FALSE)
>> TGWtop10<-topTags(df.tgw)$table
>> TGWtop10[,-1]
> logConc logFC PValue FDR
> 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21
> 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18
> 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11
> 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11
> 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11
> 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11
> 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09
> 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09
> 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09
> 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08
>> TGWtopIDs<-as.numeric(row.names(TGWtop10))
>> df$counts[TGWtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 685 346 337 340 340 313
> [2,] 382 199 256 121 103 142
> [3,] 97 59 55 29 36 35
> [4,] 171 91 111 87 73 72
> [5,] 54 24 35 12 8 14
> [6,] 629 295 345 354 352 347
> [7,] 138 58 84 41 47 38
> [8,] 200 89 96 93 75 87
> [9,] 231 138 157 149 115 128
> [10,] 145 87 81 74 75 53
>
>> df$tagwise.dispersion[TGWtopIDs]
> [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
> [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
>> df$tagwise.dispersion[CDtopIDs]
> [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105
> [7] 3.2246046 0.6054731 2.0582920 2.1059294
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list