[BioC] questions on edgeR package
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Jan 29 01:05:48 CET 2014
Dear Ming,
Something is seriously wrong -- you shouldn't get these warnings, you
shouldn't get such a large dispersion estimate and, if you do, you
shouldn't get such small p-values.
I suspect that the culprit is RSEM. edgeR is designed to accept raw read
counts rather than estimated abundance levels as might be output from RSEM
or Cufflinks.
There are a number of tools that produce genewise read counts from RNA-seq
data. My lab routinely uses subread and featureCounts:
http://www.ncbi.nlm.nih.gov/pubmed/24227677
which are available from the Bioconductor package Rsubread.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Tue, 28 Jan 2014, Ming Yi wrote:
>
> Hi,
>
> I have a few questions on edgeR package. I used it a while ago.
> Recently, I started to use it again for an ongoing project and realized
> that it has been updated a lot with many new feaures.
>
> The data I am working on has 29 pairs of tumor and matched normal tissue
> samples from the same patients that have been run through RNAseq, the
> data has been processed by the popular RSEM method to be raw counts for
> each sample (tumor or macthed normal tissue of each patients). So I have
> a matrix of raw counts for 58 columns of data for all genes. we are
> looking for differential expressed genes between tumors and normals.
> Here are some main commands I used, eliminating some basic commands for
> simplicity:
>
>> show(head(targets));
> SampleName Subject Type
> 1 T4626_01A 38_4626 Tumor
> 2 N4626_11A 38_4626 Normal
> 3 T2668_01A 44_2668 Tumor
> 4 N2668_11A 44_2668 Normal
> ......
>> Patient<-factor(targets$Subject);
>> Type<-factor(targets$Type);
> ......
> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>> show(dim(y));
> [1] 20531 58
>
>> #filter low expressed genes
>> y<-y[rowSums(y$counts)>=50,]
>> #Recompute the library sizes:
>> y$samples$lib.size <- colSums(y$counts)
>> #Use Entrez Gene IDs as row names:
>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>> ##Apply TMM normalization
>> y <- calcNormFactors(y,method=c("TMM"));
>> show(y$samples);
> group lib.size norm.factors
> T4626_01A 1 97435132 0.9834587
> N4626_11A 1 62583672 0.9154837
> T2668_01A 1 77034746 0.9687860
> N2668_11A 1 58631311 0.8644352
> ......
>> design<-model.matrix(~Patient+Type);
>> rownames(design) <- colnames(y);
>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
> Disp = 3.99994 , BCV = 2
> There were 50 or more warnings (use warnings() to see the first 50)
>> show(warnings());
> Warning messages:
> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
> non-integer x = 4.520000
> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
> non-integer x = 5.380000
> ....
>> y <- estimateGLMTrendedDisp(y, design);
> Loading required package: splines
> There were 50 or more warnings (use warnings() to see the first 50)
>> y <- estimateGLMTagwiseDisp(y, design)
> There were 50 or more warnings (use warnings() to see the first 50)
>> fit <- glmFit(y, design);
>> lrt <- glmLRT(fit);
>> show(topTags(lrt));
> Coefficient: TypeTumor
> symbols entrezID gene_id logFC logCPM LR PValue FDR
> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0
> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0
> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0
> .......
>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>> cpmRslt<-cpm(y)
>> cpmRslt<-cpmRslt[rownames(Rslt),];
>> show(cpmRslt[1:10,1:6]);
> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A
> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611
> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032
> .....
>> show(Rslt[1:10,]);
> Coefficient: TypeTumor
> symbols entrezID gene_id logFC logCPM LR PValue FDR
> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0
> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0
> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0
> .....
>> show(summary(de <- decideTestsDGE(lrt)));
> [,1]
> -1 6179
> 0 4832
> 1 7204
>
> My questions are below:
>
> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting?
>
> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings?
>
> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test?
>
> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice?
>
> 5. I did MDS plot, which looks nice in that the tumor and normal are
> separated very well in one side vs other. In the plotSmear plot, I did
> see some red spots inside the black spots zones in the center zone
> (supposed to be lower logFC), why they were called as significant? May
> this explain why we get so many DEGs >13k of them? Or large variations
> across samples?
>
> Thanks so much in advance for your help!
>
> Mike
> NIH
> Maryland, USA.
>
>> show(sessionInfo());
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> [8] base
> other attached packages:
> [1] edgeR_3.4.2 limma_3.16.8
>
>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list