[BioC] questions on edgeR package

Wed Jan 29 01:05:48 CET 2014

Dear Ming,

Something is seriously wrong -- you shouldn't get these warnings, you 
shouldn't get such a large dispersion estimate and, if you do, you 
shouldn't get such small p-values.

I suspect that the culprit is RSEM.  edgeR is designed to accept raw read 
counts rather than estimated abundance levels as might be output from RSEM 
or Cufflinks.

There are a number of tools that produce genewise read counts from RNA-seq 
data. My lab routinely uses subread and featureCounts:

  http://www.ncbi.nlm.nih.gov/pubmed/24227677

which are available from the Bioconductor package Rsubread.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Tue, 28 Jan 2014, Ming Yi wrote:

>
> Hi,
>
> I have a few questions on edgeR package. I used it a while ago. 
> Recently, I started to use it again for an ongoing project and realized 
> that it has been updated a lot with many new feaures.
>
> The data I am working on has 29 pairs of tumor and matched normal tissue 
> samples from the same patients that have been run through RNAseq, the 
> data has been processed by the popular RSEM method to be raw counts for 
> each sample (tumor or macthed normal tissue of each patients). So I have 
> a matrix of raw counts for 58 columns of data for all genes. we are 
> looking for differential expressed genes between tumors and normals. 
> Here are some main commands I used, eliminating some basic commands for 
> simplicity:
>
>> show(head(targets));
>  SampleName      Subject   Type
> 1  T4626_01A 38_4626  Tumor
> 2  N4626_11A 38_4626 Normal
> 3  T2668_01A 44_2668  Tumor
> 4  N2668_11A 44_2668 Normal
> ......
>> Patient<-factor(targets$Subject);
>> Type<-factor(targets$Type);
> ......
> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>> show(dim(y));
> [1] 20531    58
>
>> #filter low expressed genes
>> y<-y[rowSums(y$counts)>=50,]
>> #Recompute the library sizes:
>> y$samples$lib.size <- colSums(y$counts)
>> #Use Entrez Gene IDs as row names:
>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>> ##Apply TMM normalization
>> y <- calcNormFactors(y,method=c("TMM"));
>> show(y$samples);
>          group lib.size norm.factors
> T4626_01A     1 97435132    0.9834587
> N4626_11A     1 62583672    0.9154837
> T2668_01A     1 77034746    0.9687860
> N2668_11A     1 58631311    0.8644352
> ......
>> design<-model.matrix(~Patient+Type);
>> rownames(design) <- colnames(y);
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 3.99994 , BCV = 2
> There were 50 or more warnings (use warnings() to see the first 50)
>> show(warnings());
> Warning messages:
> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>  non-integer x = 4.520000
> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>  non-integer x = 5.380000
> ....
>> y <- estimateGLMTrendedDisp(y, design);
> Loading required package: splines
> There were 50 or more warnings (use warnings() to see the first 50)
>> y <- estimateGLMTagwiseDisp(y, design)
> There were 50 or more warnings (use warnings() to see the first 50)
>> fit <- glmFit(y, design);
>> lrt <- glmLRT(fit);
>> show(topTags(lrt));
> Coefficient:  TypeTumor
>        symbols entrezID        gene_id     logFC    logCPM       LR PValue FDR
> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0   0
> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0   0
> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0   0
> .......
>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>> cpmRslt<-cpm(y)
>> cpmRslt<-cpmRslt[rownames(Rslt),];
>> show(cpmRslt[1:10,1:6]);
>          T4626_01A    N4626_11A    T2668_01A    N2668_11A   T6145_01A
> 6532      2.0558647 2.140002e+02   0.02679881 1.568574e+01   0.2147611
> 1301      1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
> 5047     13.4413898 1.221761e-01   7.62426085 3.946099e-02  60.7774032
> .....
>> show(Rslt[1:10,]);
> Coefficient:  TypeTumor
>        symbols entrezID        gene_id     logFC    logCPM       LR PValue FDR
> 6532     SLC6A4     6532    SLC6A4|6532 -7.941097  6.329882 2946.202      0   0
> 1301    COL11A1     1301   COL11A1|1301  7.535040  6.150749 2105.109      0   0
> 5047       PAEP     5047      PAEP|5047  7.083690  5.655034 1498.046      0   0
> .....
>> show(summary(de <- decideTestsDGE(lrt)));
>   [,1]
> -1 6179
> 0  4832
> 1  7204
>
> My questions are below:
>
> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary.  is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting?
>
> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings?
>
> 3. The top gene lists shown in both cpmRslt and  Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt,  there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test?
>
> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice?
>
> 5. I did MDS plot, which looks nice in that the tumor and normal are 
> separated very well in one side vs other. In the plotSmear plot, I did 
> see some red spots inside the black spots zones in the center zone 
> (supposed to be lower logFC), why they were called as significant? May 
> this explain why we get so many DEGs >13k of them? Or large variations 
> across samples?
>
> Thanks so much in advance for your help!
>
> Mike
> NIH
> Maryland, USA.
>
>> show(sessionInfo());
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C                 LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
> other attached packages:
> [1] edgeR_3.4.2  limma_3.16.8
>
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}