[BioC] questions on edgeR package
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Feb 6 06:39:50 CET 2014
Dear Ming,
I have said this in a separate thread, but thought I should summarize it
for the original thread.
After some investigation, it is clear that edgeR's problem with the TCGA
RSEM "raw counts" is that they are not integers. edgeR is a likelihood-
based package and it evaluates the negative binomial likelihood, which is
always exactly zero at non-integer values. Hence edgeR cannot estimate
the dispersion sensibly if any of the input counts are fractional.
If you round the RSEM counts to integers, then edgeR will run ok.
I have made some changes to the edgeR GLM code on Bioc-devel so that it
does this rounding automatically. (The edgeR classic code was doing this
anyway.) Note that this work-around is only sensible for input values
that are estimates of counts.
Best wishes
Gordon
On Wed, 29 Jan 2014, Gordon K Smyth wrote:
> Dear Ming,
>
> Something is seriously wrong -- you shouldn't get these warnings, you
> shouldn't get such a large dispersion estimate and, if you do, you shouldn't
> get such small p-values.
>
> I suspect that the culprit is RSEM. edgeR is designed to accept raw read
> counts rather than estimated abundance levels as might be output from RSEM or
> Cufflinks.
>
> There are a number of tools that produce genewise read counts from RNA-seq
> data. My lab routinely uses subread and featureCounts:
>
> http://www.ncbi.nlm.nih.gov/pubmed/24227677
>
> which are available from the Bioconductor package Rsubread.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.statsci.org/smyth
>
> On Tue, 28 Jan 2014, Ming Yi wrote:
>
>>
>> Hi,
>>
>> I have a few questions on edgeR package. I used it a while ago. Recently, I
>> started to use it again for an ongoing project and realized that it has
>> been updated a lot with many new feaures.
>>
>> The data I am working on has 29 pairs of tumor and matched normal tissue
>> samples from the same patients that have been run through RNAseq, the data
>> has been processed by the popular RSEM method to be raw counts for each
>> sample (tumor or macthed normal tissue of each patients). So I have a
>> matrix of raw counts for 58 columns of data for all genes. we are looking
>> for differential expressed genes between tumors and normals. Here are some
>> main commands I used, eliminating some basic commands for simplicity:
>>
>>> show(head(targets));
>> SampleName Subject Type
>> 1 T4626_01A 38_4626 Tumor
>> 2 N4626_11A 38_4626 Normal
>> 3 T2668_01A 44_2668 Tumor
>> 4 N2668_11A 44_2668 Normal
>> ......
>>> Patient<-factor(targets$Subject);
>>> Type<-factor(targets$Type);
>> ......
>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>> show(dim(y));
>> [1] 20531 58
>>
>>> #filter low expressed genes
>>> y<-y[rowSums(y$counts)>=50,]
>>> #Recompute the library sizes:
>>> y$samples$lib.size <- colSums(y$counts)
>>> #Use Entrez Gene IDs as row names:
>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>> ##Apply TMM normalization
>>> y <- calcNormFactors(y,method=c("TMM"));
>>> show(y$samples);
>> group lib.size norm.factors
>> T4626_01A 1 97435132 0.9834587
>> N4626_11A 1 62583672 0.9154837
>> T2668_01A 1 77034746 0.9687860
>> N2668_11A 1 58631311 0.8644352
>> ......
>>> design<-model.matrix(~Patient+Type);
>>> rownames(design) <- colnames(y);
>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>> Disp = 3.99994 , BCV = 2
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> show(warnings());
>> Warning messages:
>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>> non-integer x = 4.520000
>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>> non-integer x = 5.380000
>> ....
>>> y <- estimateGLMTrendedDisp(y, design);
>> Loading required package: splines
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> y <- estimateGLMTagwiseDisp(y, design)
>> There were 50 or more warnings (use warnings() to see the first 50)
>>> fit <- glmFit(y, design);
>>> lrt <- glmLRT(fit);
>>> show(topTags(lrt));
>> Coefficient: TypeTumor
>> symbols entrezID gene_id logFC logCPM LR PValue
>> FDR
>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0
>> 0
>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0
>> 0
>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0
>> 0
>> .......
>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>> cpmRslt<-cpm(y)
>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>> show(cpmRslt[1:10,1:6]);
>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A
>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611
>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032
>> .....
>>> show(Rslt[1:10,]);
>> Coefficient: TypeTumor
>> symbols entrezID gene_id logFC logCPM LR PValue
>> FDR
>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0
>> 0
>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0
>> 0
>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0
>> 0
>> .....
>>> show(summary(de <- decideTestsDGE(lrt)));
>> [,1]
>> -1 6179
>> 0 4832
>> 1 7204
>>
>> My questions are below:
>>
>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower
>> expressed using 50 as cutoff here,kind of arbitrary. is there any better
>> way to assess the cutoff value? can any one suggest a good cutoff based on
>> his or her experience in such setting?
>>
>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives
>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why
>> we got warning here (see warning message as above inside my commands), any
>> issue on the warnings?
>>
>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent,
>> for example, gene 6532 always have tumor samples consistently much lower in
>> cpm counts compared to that of matched normal samples, which is consistent
>> with the logFC result in Rslt. So from the top genes, the result looks very
>> promising. However, when I look into the full table of Rslt, there are so
>> many genes appear to be differential expressed (see result of summary(de <-
>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in
>> the dataset) in this table with FDR <0.05, which make me a bit nervous
>> considering the issues as I asked in last two questions. Is this what was
>> usually seen in tumor vs normal comparison. This is a paired test of
>> course, which may increase the power and may see more differntial genes.
>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs
>> normal sample pairs? which one shall be better for RNA-seq data, paired
>> test vs pooled test?
>>
>> 4. For pooled test, I may use exactTest(), also based on the guide, there
>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm
>> model. anything like that worthy of trying? Any advice?
>>
>> 5. I did MDS plot, which looks nice in that the tumor and normal are
>> separated very well in one side vs other. In the plotSmear plot, I did see
>> some red spots inside the black spots zones in the center zone (supposed to
>> be lower logFC), why they were called as significant? May this explain why
>> we get so many DEGs >13k of them? Or large variations across samples?
>>
>> Thanks so much in advance for your help!
>>
>> Mike
>> NIH
>> Maryland, USA.
>>
>>> show(sessionInfo());
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> attached base packages:
>> [1] splines stats graphics grDevices utils datasets methods
>> [8] base
>> other attached packages:
>> [1] edgeR_3.4.2 limma_3.16.8
>>
>>
>>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list