[BioC] questions on edgeR package
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Feb 6 23:15:36 CET 2014
Ming,
Given RSEM counts, I would personally use the limma-voom for preference
because it doesn't require you to round the counts. But you could use
either in your case. I said the same to you yesterday:
https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/057538.html
Gordon
On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote:
>
> Dear Dr. Smyth:
>
> Thanks so much for your help and advice. Indeed, I did try to use edger
> on rounded RSEM processed raw counts, result is quite different, here
> are the comparisons FYI:
>
> Not rounded:
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 3.99994 , BCV = 2
> After rounded the RSEM raw counts to integers:
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 0.32356 , BCV = 0.5688
>
>
> Not rounded:
>> show(summary(de <- decideTestsDGE(lrt)));
> [,1]
> -1 6179
> 0 4832
> 1 7204
> After rounded the RSEM raw counts to integers:
>> show(summary(de <- decideTestsDGE(lrt)));
> [,1]
> -1 4948
> 0 7778
> 1 5490
>
> So observation is: after rounded, the dispersion is much smaller, and
> DEGs also has about 3k less genes, which appears to be promising.
>
> Then compared with limma-voom result of the same data:
> show(summary(de <- decideTests(fit)));
> ...
> PatientTCGA_91_6836 TypeTumor
> -1 13 4917
> 0 18186 7276
> 1 16 6022
>
>
> The number of EDGs seems rather close to that when using edgeR on
> rounded RSEM raw counts. But what I am trying to ask your opinion here:
> in these situations, you would like more on the limma-voom result or the
> edgeR result on rounded RSEM raw counts? Since I used limma a lot on
> array data, but not much except this one on RNA-seq data.
>
> Thanks again for all your help and advice!
>
> Best,
>
> Ming
>
>
>
>
>
> -----Original Message-----
> From: Gordon K Smyth [mailto:smyth at wehi.edu.au]
> Sent: Thursday, February 06, 2014 12:40 AM
> To: Ming Yi
> Cc: Bioconductor mailing list
> Subject: Re: [BioC] questions on edgeR package
>
> Dear Ming,
>
> I have said this in a separate thread, but thought I should summarize it
> for the original thread.
>
> After some investigation, it is clear that edgeR's problem with the TCGA
> RSEM "raw counts" is that they are not integers. edgeR is a likelihood-
> based package and it evaluates the negative binomial likelihood, which is
> always exactly zero at non-integer values. Hence edgeR cannot estimate
> the dispersion sensibly if any of the input counts are fractional.
>
> If you round the RSEM counts to integers, then edgeR will run ok.
>
> I have made some changes to the edgeR GLM code on Bioc-devel so that it
> does this rounding automatically. (The edgeR classic code was doing this
> anyway.) Note that this work-around is only sensible for input values
> that are estimates of counts.
>
> Best wishes
> Gordon
>
> On Wed, 29 Jan 2014, Gordon K Smyth wrote:
>
>> Dear Ming,
>>
>> Something is seriously wrong -- you shouldn't get these warnings, you
>> shouldn't get such a large dispersion estimate and, if you do, you shouldn't
>> get such small p-values.
>>
>> I suspect that the culprit is RSEM. edgeR is designed to accept raw read
>> counts rather than estimated abundance levels as might be output from RSEM or
>> Cufflinks.
>>
>> There are a number of tools that produce genewise read counts from RNA-seq
>> data. My lab routinely uses subread and featureCounts:
>>
>> http://www.ncbi.nlm.nih.gov/pubmed/24227677
>>
>> which are available from the Bioconductor package Rsubread.
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Tue, 28 Jan 2014, Ming Yi wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a few questions on edgeR package. I used it a while ago. Recently, I
>>> started to use it again for an ongoing project and realized that it has
>>> been updated a lot with many new feaures.
>>>
>>> The data I am working on has 29 pairs of tumor and matched normal tissue
>>> samples from the same patients that have been run through RNAseq, the data
>>> has been processed by the popular RSEM method to be raw counts for each
>>> sample (tumor or macthed normal tissue of each patients). So I have a
>>> matrix of raw counts for 58 columns of data for all genes. we are looking
>>> for differential expressed genes between tumors and normals. Here are some
>>> main commands I used, eliminating some basic commands for simplicity:
>>>
>>>> show(head(targets));
>>> SampleName Subject Type
>>> 1 T4626_01A 38_4626 Tumor
>>> 2 N4626_11A 38_4626 Normal
>>> 3 T2668_01A 44_2668 Tumor
>>> 4 N2668_11A 44_2668 Normal
>>> ......
>>>> Patient<-factor(targets$Subject);
>>>> Type<-factor(targets$Type);
>>> ......
>>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>>> show(dim(y));
>>> [1] 20531 58
>>>
>>>> #filter low expressed genes
>>>> y<-y[rowSums(y$counts)>=50,]
>>>> #Recompute the library sizes:
>>>> y$samples$lib.size <- colSums(y$counts)
>>>> #Use Entrez Gene IDs as row names:
>>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>>> ##Apply TMM normalization
>>>> y <- calcNormFactors(y,method=c("TMM"));
>>>> show(y$samples);
>>> group lib.size norm.factors
>>> T4626_01A 1 97435132 0.9834587
>>> N4626_11A 1 62583672 0.9154837
>>> T2668_01A 1 77034746 0.9687860
>>> N2668_11A 1 58631311 0.8644352
>>> ......
>>>> design<-model.matrix(~Patient+Type);
>>>> rownames(design) <- colnames(y);
>>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>>> Disp = 3.99994 , BCV = 2
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> show(warnings());
>>> Warning messages:
>>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>> non-integer x = 4.520000
>>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>> non-integer x = 5.380000
>>> ....
>>>> y <- estimateGLMTrendedDisp(y, design);
>>> Loading required package: splines
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> y <- estimateGLMTagwiseDisp(y, design)
>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>> fit <- glmFit(y, design);
>>>> lrt <- glmLRT(fit);
>>>> show(topTags(lrt));
>>> Coefficient: TypeTumor
>>> symbols entrezID gene_id logFC logCPM LR PValue
>>> FDR
>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0
>>> 0
>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0
>>> 0
>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0
>>> 0
>>> .......
>>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>>> cpmRslt<-cpm(y)
>>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>>> show(cpmRslt[1:10,1:6]);
>>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A
>>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611
>>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032
>>> .....
>>>> show(Rslt[1:10,]);
>>> Coefficient: TypeTumor
>>> symbols entrezID gene_id logFC logCPM LR PValue
>>> FDR
>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0
>>> 0
>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0
>>> 0
>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0
>>> 0
>>> .....
>>>> show(summary(de <- decideTestsDGE(lrt)));
>>> [,1]
>>> -1 6179
>>> 0 4832
>>> 1 7204
>>>
>>> My questions are below:
>>>
>>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower
>>> expressed using 50 as cutoff here,kind of arbitrary. is there any better
>>> way to assess the cutoff value? can any one suggest a good cutoff based on
>>> his or her experience in such setting?
>>>
>>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives
>>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why
>>> we got warning here (see warning message as above inside my commands), any
>>> issue on the warnings?
>>>
>>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent,
>>> for example, gene 6532 always have tumor samples consistently much lower in
>>> cpm counts compared to that of matched normal samples, which is consistent
>>> with the logFC result in Rslt. So from the top genes, the result looks very
>>> promising. However, when I look into the full table of Rslt, there are so
>>> many genes appear to be differential expressed (see result of summary(de <-
>>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in
>>> the dataset) in this table with FDR <0.05, which make me a bit nervous
>>> considering the issues as I asked in last two questions. Is this what was
>>> usually seen in tumor vs normal comparison. This is a paired test of
>>> course, which may increase the power and may see more differntial genes.
>>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs
>>> normal sample pairs? which one shall be better for RNA-seq data, paired
>>> test vs pooled test?
>>>
>>> 4. For pooled test, I may use exactTest(), also based on the guide, there
>>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm
>>> model. anything like that worthy of trying? Any advice?
>>>
>>> 5. I did MDS plot, which looks nice in that the tumor and normal are
>>> separated very well in one side vs other. In the plotSmear plot, I did see
>>> some red spots inside the black spots zones in the center zone (supposed to
>>> be lower logFC), why they were called as significant? May this explain why
>>> we get so many DEGs >13k of them? Or large variations across samples?
>>>
>>> Thanks so much in advance for your help!
>>>
>>> Mike
>>> NIH
>>> Maryland, USA.
>>>
>>>> show(sessionInfo());
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] splines stats graphics grDevices utils datasets methods
>>> [8] base
>>> other attached packages:
>>> [1] edgeR_3.4.2 limma_3.16.8
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list