[BioC] questions on edgeR package

Thu Feb 6 23:15:36 CET 2014

Ming,

Given RSEM counts, I would personally use the limma-voom for preference 
because it doesn't require you to round the counts.  But you could use 
either in your case.  I said the same to you yesterday:

https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/057538.html

Gordon

On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote:

>
> Dear Dr. Smyth:
>
> Thanks so much for your help and advice. Indeed, I did try to use edger 
> on rounded RSEM processed raw counts, result is quite different, here 
> are the comparisons FYI:
>
> Not rounded:
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 3.99994 , BCV = 2
> After rounded the RSEM raw counts to integers:
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 0.32356 , BCV = 0.5688
>
>
> Not rounded:
>> show(summary(de <- decideTestsDGE(lrt)));
>   [,1]
> -1 6179
> 0  4832
> 1  7204
> After rounded the RSEM raw counts to integers:
>> show(summary(de <- decideTestsDGE(lrt)));
>   [,1]
> -1 4948
> 0  7778
> 1  5490
>
> So observation is: after rounded, the dispersion is much smaller, and 
> DEGs also has about 3k less genes, which appears to be promising.
>
> Then compared with limma-voom result of the same data:
> show(summary(de <- decideTests(fit)));
> ...
>   PatientTCGA_91_6836 TypeTumor
> -1                  13      4917
> 0                18186      7276
> 1                   16      6022
>
>
> The number of EDGs seems rather close to that when using edgeR on 
> rounded RSEM raw counts. But what I am trying to ask your opinion here: 
> in these situations, you would like more on the limma-voom result or the 
> edgeR result on rounded RSEM raw counts? Since I used limma a lot on 
> array data, but not much except this one on RNA-seq data.
>
> Thanks again for all your help and advice!
>
> Best,
>
> Ming
>
>
>
>
>
> -----Original Message-----
> From: Gordon K Smyth [mailto:smyth at wehi.edu.au]
> Sent: Thursday, February 06, 2014 12:40 AM
> To: Ming Yi
> Cc: Bioconductor mailing list
> Subject: Re: [BioC] questions on edgeR package
>
> Dear Ming,
>
> I have said this in a separate thread, but thought I should summarize it
> for the original thread.
>
> After some investigation, it is clear that edgeR's problem with the TCGA
> RSEM "raw counts" is that they are not integers.  edgeR is a likelihood-
> based package and it evaluates the negative binomial likelihood, which is
> always exactly zero at non-integer values.  Hence edgeR cannot estimate
> the dispersion sensibly if any of the input counts are fractional.
>
> If you round the RSEM counts to integers, then edgeR will run ok.
>
> I have made some changes to the edgeR GLM code on Bioc-devel so that it
> does this rounding automatically.  (The edgeR classic code was doing this
> anyway.)  Note that this work-around is only sensible for input values
> that are estimates of counts.
>
> Best wishes
> Gordon
>
> On Wed, 29 Jan 2014, Gordon K Smyth wrote:
>
>> Dear Ming,
>>
>> Something is seriously wrong -- you shouldn't get these warnings, you
>> shouldn't get such a large dispersion estimate and, if you do, you shouldn't
>> get such small p-values.
>>
>> I suspect that the culprit is RSEM.  edgeR is designed to accept raw read
>> counts rather than estimated abundance levels as might be output from RSEM or
>> Cufflinks.
>>
>> There are a number of tools that produce genewise read counts from RNA-seq
>> data. My lab routinely uses subread and featureCounts:
>>
>> http://www.ncbi.nlm.nih.gov/pubmed/24227677
>>
>> which are available from the Bioconductor package Rsubread.
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Tue, 28 Jan 2014, Ming Yi wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a few questions on edgeR package. I used it a while ago. Recently, I
>>> started to use it again for an ongoing project and realized that it has
>>> been updated a lot with many new feaures.
>>>
>>> The data I am working on has 29 pairs of tumor and matched normal tissue
>>> samples from the same patients that have been run through RNAseq, the data
>>> has been processed by the popular RSEM method to be raw counts for each
>>> sample (tumor or macthed normal tissue of each patients). So I have a
>>> matrix of raw counts for 58 columns of data for all genes. we are looking
>>> for differential expressed genes between tumors and normals. Here are some
>>> main commands I used, eliminating some basic commands for simplicity:
>>>
>>>> show(head(targets));
>>>  SampleName      Subject   Type
>>> 1  T4626_01A 38_4626  Tumor
>>> 2  N4626_11A 38_4626 Normal
>>> 3  T2668_01A 44_2668  Tumor
>>> 4  N2668_11A 44_2668 Normal
>>> ......
>>>> Patient<-factor(targets$Subject);
>>>> Type<-factor(targets$Type);
>>> ......
>>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>>> show(dim(y));
>>> [1] 20531    58
>>>
>>>> #filter low expressed genes
>>>> y<-y[rowSums(y$counts)>=50,]
>>>> #Recompute the library sizes:
>>>> y$samples$lib.size <- colSums(y$counts)
>>>> #Use Entrez Gene IDs as row names:
>>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>>> ##Apply TMM normalization
>>>> y <- calcNormFactors(y,method=c("TMM"));
>>>> show(y$samples);
>>>          group lib.size norm.factors
>>> T4626_01A     1 97435132    0.9834587
>>> N4626_11A     1 62583672    0.9154837
>>> T2668_01A     1 77034746    0.9687860
>>> N2668_11A     1 58631311    0.8644352
>>> ......
>>>> design<-model.matrix(~Patient+Type);
>>>> rownames(design) <- colnames(y);
>>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>>> Disp = 3.99994 , BCV = 2
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> show(warnings());
>>> Warning messages:
>>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>  non-integer x = 4.520000
>>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>  non-integer x = 5.380000
>>> ....
>>>> y <- estimateGLMTrendedDisp(y, design);
>>> Loading required package: splines
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> y <- estimateGLMTagwiseDisp(y, design)
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> fit <- glmFit(y, design);
>>>> lrt <- glmLRT(fit);
>>>> show(topTags(lrt));
>>> Coefficient:  TypeTumor
>>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue
>>> FDR
>>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0
>>> 0
>>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0
>>> 0
>>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0
>>> 0
>>> .......
>>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>>> cpmRslt<-cpm(y)
>>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>>> show(cpmRslt[1:10,1:6]);
>>>          T4626_01A    N4626_11A    T2668_01A    N2668_11A   T6145_01A
>>> 6532      2.0558647 2.140002e+02   0.02679881 1.568574e+01   0.2147611
>>> 1301      1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>>> 5047     13.4413898 1.221761e-01   7.62426085 3.946099e-02  60.7774032
>>> .....
>>>> show(Rslt[1:10,]);
>>> Coefficient:  TypeTumor
>>>        symbols entrezID        gene_id     logFC    logCPM       LR PValue
>>> FDR
>>> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0
>>> 0
>>> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0
>>> 0
>>> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0
>>> 0
>>> .....
>>>> show(summary(de <- decideTestsDGE(lrt)));
>>>   [,1]
>>> -1 6179
>>> 0  4832
>>> 1  7204
>>>
>>> My questions are below:
>>>
>>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower
>>> expressed using 50 as cutoff here,kind of arbitrary.  is there any better
>>> way to assess the cutoff value? can any one suggest a good cutoff based on
>>> his or her experience in such setting?
>>>
>>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives
>>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why
>>> we got warning here (see warning message as above inside my commands), any
>>> issue on the warnings?
>>>
>>> 3. The top gene lists shown in both cpmRslt and  Rslt are quite consistent,
>>> for example, gene 6532 always have tumor samples consistently much lower in
>>> cpm counts compared to that of matched normal samples, which is consistent
>>> with the logFC result in Rslt. So from the top genes, the result looks very
>>> promising. However, when I look into the full table of Rslt,  there are so
>>> many genes appear to be differential expressed (see result of summary(de <-
>>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in
>>> the dataset) in this table with FDR <0.05, which make me a bit nervous
>>> considering the issues as I asked in last two questions. Is this what was
>>> usually seen in tumor vs normal comparison. This is a paired test of
>>> course, which may increase the power and may see more differntial genes.
>>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs
>>> normal sample pairs? which one shall be better for RNA-seq data, paired
>>> test vs pooled test?
>>>
>>> 4. For pooled test, I may use exactTest(), also based on the guide, there
>>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm
>>> model. anything like that worthy of trying? Any advice?
>>>
>>> 5. I did MDS plot, which looks nice in that the tumor and normal are
>>> separated very well in one side vs other. In the plotSmear plot, I did see
>>> some red spots inside the black spots zones in the center zone (supposed to
>>> be lower logFC), why they were called as significant? May this explain why
>>> we get so many DEGs >13k of them? Or large variations across samples?
>>>
>>> Thanks so much in advance for your help!
>>>
>>> Mike
>>> NIH
>>> Maryland, USA.
>>>
>>>> show(sessionInfo());
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] splines   stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> other attached packages:
>>> [1] edgeR_3.4.2  limma_3.16.8

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}