[BioC] questions on edgeR package
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Jan 30 01:52:48 CET 2014
Dear Ming,
If the data truly are expected counts, then limma-voom will analyse them
very well.
If the data are estimated expression levels instead of expected counts, or
have been normalized by gc content or gene length then the situation is
more difficult again. In that case a good statistical analysis is
difficult and requires knowledge of the gene lengths and gc contents.
You could try limma-voom on such data, but it will not operate at full
power without modification.
I advised you in my previous email that no package based on the negative
binomial distribution will work satisfactorily with these estimated
quantities. I see that you have written separately to Mike Love and that
he has given you the same message.
I have checked with my lab, and I was incorrect to say that we have
downloaded SRA files for TCGA -- it was actually downloaded RSEM counts
that we downloaded. I think it would be a good idea for people like you
to put pressure on TCGA to make unprocessed data available.
Best wishes
Gordon
On Wed, 29 Jan 2014, Ming Yi wrote:
> Dear Dr Smyth:
>
> Thanks so much for the advice and help. We actually found files with
> raw counts from TCGA RNAseq v1 dataset (we were discussing about V2
> RNAseq data). Unfortunately, the V1 dataset only has a portion of
> samples in V2 dataset, and V2 only has those RSEM estimated counts. we
> currently do not have the controlled access permission for fastq and bam
> files. I also searched SRA and could not find the TCGA RNAseq data (many
> whole genome or exome-seq data) there. We kind of stuck there.
>
> Thanks again for your advice and help!
>
> best
> Ming
>
>
>
>
>> Date: Wed, 29 Jan 2014 15:11:54 +1100
>> From: smyth at wehi.EDU.AU
>> To: yi02 at hotmail.com
>> CC: bioconductor at r-project.org
>> Subject: RE: questions on edgeR package
>>
>> Dear Ming,
>>
>> Thanks for the further information.
>>
>> It is obvious that numbers like 4.67 are not raw integer counts. I
>> suspect that they are posterior expected counts from RSEM. The column
>> heading "raw_count" does seem rather misleading.
>>
>> If you want to analyse these expected counts from RSEM, then limma-voom
>> would be a far preferable pipeline than edgeR (or DESeq or any other
>> negative binomial based package).
>>
>> I understand that the RSEM authors did claim that their expected counts
>> could be input to edgeR, but edgeR is designed to accept integers, and
>> your results appear to confirm some of the problems that arise. The
>> problems cannot be solved by changing the filtering.
>>
>> The raw FastQ files for the TCGA samples should be publicly available from
>> GEO and SRA. My lab has downloaded similar TCGA data as SRA files and
>> converted them to FastQ.
>>
>> Best wishes
>> Gordon
>>
>>
>> On Wed, 29 Jan 2014, Ming Yi wrote:
>>
>>> Dear Dr Smyth:
>>>
>>> Thanks for the quick response. The data is in fact TCGA RNAseq data, one of the files looks like below:
>>> file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.genes.results"
>>>
>>> its content looks below:
>>> gene_id raw_count scaled_estimate transcript_id
>>> 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1
>>> 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1
>>> 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2
>>> 4 ?|10357 218.79 1.933490e-05 uc010zzl.1
>>> 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1
>>> 6 ?|136542 0.00 0.000000e+00 uc011krn.1
>>> ......
>>> through contacting with TCGA staff, we know the column "raw_count" is the raw counts of reads after running through RSEM and the normalized data in a different file (I used the raw counts not normalized data for edgeR). The basic description of the data as below (at URL: https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/luad/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/unc.edu_LUAD.IlluminaHiSeq_RNASeqV2.mage-tab.1.13.0/DESCRIPTION.txt)
>>> step 4 of: V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1
>>> ...The reads aligned to the reference genome sequences are translated to transcriptome coordinates
>>> prior to transcript level quantification. This step is performed using the UNC Bioinformatics Utilities (UBU)
>>> software (https://github.com/mozack/ubu). RSEM is used to estimate gene and transcript abundances
>>> (http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html). ....
>>>
>>> Unfortunately, we do not have access to the raw RNA-seq data nor bam files etc. The authors of these data files also suggest to use the raw counts as mentioned above for analysis using edgeR or DESeq. Unless there are something wrong with their procedure, we have no control for that.
>>>
>>> Assuming the data is OK, any possibility other steps in edgeR may cause this? Such as filtering step as I asked:
>>> #filter low expressed genes
>>>> y<-y[rowSums(y$counts)>=50,]
>>> this step using arbitraay cutoff 50 has filtered out quite a few genes (rows) with sum of rows in counts of reads less than 50.
>>>
>>> or other steps?
>>>
>>> Any idea?
>>>
>>> Thanks again!
>>>
>>> Best
>>>
>>> Ming
>>>
>>>
>>>> Date: Wed, 29 Jan 2014 11:05:48 +1100
>>>> From: smyth at wehi.EDU.AU
>>>> To: yi02 at hotmail.com
>>>> CC: bioconductor at r-project.org
>>>> Subject: Re: questions on edgeR package
>>>>
>>>> Dear Ming,
>>>>
>>>> Something is seriously wrong -- you shouldn't get these warnings, you
>>>> shouldn't get such a large dispersion estimate and, if you do, you
>>>> shouldn't get such small p-values.
>>>>
>>>> I suspect that the culprit is RSEM. edgeR is designed to accept raw read
>>>> counts rather than estimated abundance levels as might be output from RSEM
>>>> or Cufflinks.
>>>>
>>>> There are a number of tools that produce genewise read counts from RNA-seq
>>>> data. My lab routinely uses subread and featureCounts:
>>>>
>>>> http://www.ncbi.nlm.nih.gov/pubmed/24227677
>>>>
>>>> which are available from the Bioconductor package Rsubread.
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>> ---------------------------------------------
>>>> Professor Gordon K Smyth,
>>>> Bioinformatics Division,
>>>> Walter and Eliza Hall Institute of Medical Research,
>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>> http://www.statsci.org/smyth
>>>>
>>>> On Tue, 28 Jan 2014, Ming Yi wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a few questions on edgeR package. I used it a while ago.
>>>>> Recently, I started to use it again for an ongoing project and realized
>>>>> that it has been updated a lot with many new feaures.
>>>>>
>>>>> The data I am working on has 29 pairs of tumor and matched normal tissue
>>>>> samples from the same patients that have been run through RNAseq, the
>>>>> data has been processed by the popular RSEM method to be raw counts for
>>>>> each sample (tumor or macthed normal tissue of each patients). So I have
>>>>> a matrix of raw counts for 58 columns of data for all genes. we are
>>>>> looking for differential expressed genes between tumors and normals.
>>>>> Here are some main commands I used, eliminating some basic commands for
>>>>> simplicity:
>>>>>
>>>>>> show(head(targets));
>>>>> SampleName Subject Type
>>>>> 1 T4626_01A 38_4626 Tumor
>>>>> 2 N4626_11A 38_4626 Normal
>>>>> 3 T2668_01A 44_2668 Tumor
>>>>> 4 N2668_11A 44_2668 Normal
>>>>> ......
>>>>>> Patient<-factor(targets$Subject);
>>>>>> Type<-factor(targets$Type);
>>>>> ......
>>>>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]);
>>>>>> show(dim(y));
>>>>> [1] 20531 58
>>>>>
>>>>>> #filter low expressed genes
>>>>>> y<-y[rowSums(y$counts)>=50,]
>>>>>> #Recompute the library sizes:
>>>>>> y$samples$lib.size <- colSums(y$counts)
>>>>>> #Use Entrez Gene IDs as row names:
>>>>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID
>>>>>> ##Apply TMM normalization
>>>>>> y <- calcNormFactors(y,method=c("TMM"));
>>>>>> show(y$samples);
>>>>> group lib.size norm.factors
>>>>> T4626_01A 1 97435132 0.9834587
>>>>> N4626_11A 1 62583672 0.9154837
>>>>> T2668_01A 1 77034746 0.9687860
>>>>> N2668_11A 1 58631311 0.8644352
>>>>> ......
>>>>>> design<-model.matrix(~Patient+Type);
>>>>>> rownames(design) <- colnames(y);
>>>>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE);
>>>>> Disp = 3.99994 , BCV = 2
>>>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>>>> show(warnings());
>>>>> Warning messages:
>>>>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>>> non-integer x = 4.520000
>>>>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>>>>> non-integer x = 5.380000
>>>>> ....
>>>>>> y <- estimateGLMTrendedDisp(y, design);
>>>>> Loading required package: splines
>>>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>>>> y <- estimateGLMTagwiseDisp(y, design)
>>>>> There were 50 or more warnings (use warnings() to see the first 50)
>>>>>> fit <- glmFit(y, design);
>>>>>> lrt <- glmLRT(fit);
>>>>>> show(topTags(lrt));
>>>>> Coefficient: TypeTumor
>>>>> symbols entrezID gene_id logFC logCPM LR PValue FDR
>>>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0
>>>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0
>>>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0
>>>>> .......
>>>>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue");
>>>>>> cpmRslt<-cpm(y)
>>>>>> cpmRslt<-cpmRslt[rownames(Rslt),];
>>>>>> show(cpmRslt[1:10,1:6]);
>>>>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A
>>>>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611
>>>>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056
>>>>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032
>>>>> .....
>>>>>> show(Rslt[1:10,]);
>>>>> Coefficient: TypeTumor
>>>>> symbols entrezID gene_id logFC logCPM LR PValue FDR
>>>>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 0
>>>>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 0
>>>>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 0
>>>>> .....
>>>>>> show(summary(de <- decideTestsDGE(lrt)));
>>>>> [,1]
>>>>> -1 6179
>>>>> 0 4832
>>>>> 1 7204
>>>>>
>>>>> My questions are below:
>>>>>
>>>>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower expressed using 50 as cutoff here,kind of arbitrary. is there any better way to assess the cutoff value? can any one suggest a good cutoff based on his or her experience in such setting?
>>>>>
>>>>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why we got warning here (see warning message as above inside my commands), any issue on the warnings?
>>>>>
>>>>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, for example, gene 6532 always have tumor samples consistently much lower in cpm counts compared to that of matched normal samples, which is consistent with the logFC result in Rslt. So from the top genes, the result looks very promising. However, when I look into the full table of Rslt, there are so many genes appear to be differential expressed (see result of summary(de <- decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in the dataset) in this table with FDR <0.05, which make me a bit nervous considering the issues as I asked in last two questions. Is this what was usually seen in tumor vs normal comparison. This is a paired test of course, which may increase the power and may see more differntial genes. what if I try pooled tumors vs pooled normals disregard of matched tumor vs normal sample pairs? which one shall be better for RNA-seq data, paired test vs pooled test?
>>>>>
>>>>> 4. For pooled test, I may use exactTest(), also based on the guide, there is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm model. anything like that worthy of trying? Any advice?
>>>>>
>>>>> 5. I did MDS plot, which looks nice in that the tumor and normal are
>>>>> separated very well in one side vs other. In the plotSmear plot, I did
>>>>> see some red spots inside the black spots zones in the center zone
>>>>> (supposed to be lower logFC), why they were called as significant? May
>>>>> this explain why we get so many DEGs >13k of them? Or large variations
>>>>> across samples?
>>>>>
>>>>> Thanks so much in advance for your help!
>>>>>
>>>>> Mike
>>>>> NIH
>>>>> Maryland, USA.
>>>>>
>>>>>> show(sessionInfo());
>>>>> R version 3.0.1 (2013-05-16)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>> attached base packages:
>>>>> [1] splines stats graphics grDevices utils datasets methods
>>>>> [8] base
>>>>> other attached packages:
>>>>> [1] edgeR_3.4.2 limma_3.16.8
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the addressee.
>> You must not disclose, forward, print or use it without the permission of the sender.
>> ______________________________________________________________________
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list