[BioC] DESeq analysis

Thu Jul 5 09:31:58 CEST 2012

Dear Fatemehsadat Seyednasrollah

this type of differences in the results may reflect the different 
approaches to estimating dispersion of the two packages and is not 
completely uncommon. I'd recommend to study the genes where the results 
differ in more detail, i.e. look at their data and possibly their 
biological roles.

If your samples are 1000 Genomes cell lines, I would expect large 
variability between them, and indeed not that much signal in a 4 against 
4 comparison.

	Best wishes
	Wolfgang

Jul/3/12 1:56 PM, Fatemehsadat Seyednasrollah scripsit::
> Hi,
> First thanks a lot for your answer.
> Actually I have used a subset of a public data from Bowtie(the Montgomery)(http://bowtie-bio.sourceforge.net/recount/)
>
>   and below are the reduced
> codes of my work both from edgeR and DEseq. I wanted to know have I done
> something wrong to obtain
> very different answers ( 85 from DESeq and 407 from edgeR) or it is natural
> to have this hude difference
> and it is related to the algorithms?
>
> edgeR:
>> g1 <- read.delim ("count1.txt", row.names = 1)
>> head(g1)
>                  NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
> ENSG00000000003        0        0        0        0        1        0        0
> ENSG00000000005        0        0        0        0        0        0        0
> ENSG00000000419       10       24       19       20       19        8       14
> ENSG00000000457       17       15       13       18       21       18       21
> ENSG00000000460        2        3        5        2        4        6        8
> ENSG00000000938       20        4       35       16       10       17       19
>                  NA10847F
> ENSG00000000003        0
> ENSG00000000005        0
> ENSG00000000419        6
> ENSG00000000457       15
> ENSG00000000460        2
> ENSG00000000938        9
>> group <- factor(rep(c("Male", "Female"), each= 4))
>> dge <- DGEList(counts = g1 , group = group )
> Calculating library sizes from column totals.
>> dge <- calcNormFactors(dge)
>> dge <- estimateCommonDisp(dge)
>> sqrt (dge$common.dispersion)
> [1] 0.3858996
>> test <- exactTest(dge)
>> head(test$table)
>                       logFC    logCPM    PValue
> ENSG00000000003 -2.3441897 -3.042057 1.0000000
> ENSG00000000005  0.0000000      -Inf 1.0000000
> ENSG00000000419  0.5777309  3.850993 0.2791539
> ENSG00000000457 -0.3054489  4.080866 0.5592668
> ENSG00000000460 -0.7792622  1.966865 0.3274528
> ENSG00000000938  0.3909100  3.997866 0.4269672
>> sum (test$table$PValue <0.01)
> [1] 407
>
>
> DESeq:
>
>> g1 <- read.table("count1.txt", header = TRUE, row.names = 1)
>> conds <- factor(rep(c("Male", "Female"), each= 4))
>> dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4))
>> dataPack
>           condition
> NA06994M      Male
> NA07051M      Male
> NA07347M      Male
> NA07357M      Male
> NA07000F    Female
> NA07037F    Female
> NA07346F    Female
> NA10847F    Female
>> cds <- newCountDataSet(g1, conds)
>> head(cds)
> CountDataSet (storageMode: environment)
> assayData: 1 features, 8 samples
>    element names: counts
> protocolData: none
> phenoData
>    sampleNames: NA06994M NA07051M ... NA10847F (8 total)
>    varLabels: sizeFactor condition
>    varMetadata: labelDescription
> featureData: none
> experimentData: use 'experimentData(object)'
> Annotation:
>> head(counts(cds)
> + )
>                  NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
> ENSG00000000003        0        0        0        0        1        0        0
> ENSG00000000005        0        0        0        0        0        0        0
> ENSG00000000419       10       24       19       20       19        8       14
> ENSG00000000457       17       15       13       18       21       18       21
> ENSG00000000460        2        3        5        2        4        6        8
> ENSG00000000938       20        4       35       16       10       17       19
>                  NA10847F
> ENSG00000000003        0
> ENSG00000000005        0
> ENSG00000000419        6
> ENSG00000000457       15
> ENSG00000000460        2
> ENSG00000000938        9
>> cds <- estimateSizeFactors(cds)
>> sizeFactors(cds)
>   NA06994M  NA07051M  NA07347M  NA07357M  NA07000F  NA07037F  NA07346F  NA10847F
> 0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086
>> head(counts(cds, normalized= TRUE))
>
>> cds <- estimateDispersions(cds)
>> result <- nbinomTest(cds, "Male", "Female")
>> nrow(subset(result, result$pval <0.01))
> [1] 85
>
> Again thank you so much
> With Best Regards,
> Narges_
>
> ####################################################################
>
> Dear Narges
>
> thank you for the feedback. Your second question is easy: use the idiom
>       res1 <- subset(res, padj<0.1)
> instead, this will avoid the creation of rows full of NA whenever
> res$padj is NA. Alternatively
>       res[order(res$padj)[1:n], ]
> with 'n' your favourite lucky number might be useful. Have a look at the
> R-intro manual for more on subsetting of arrays and dataframes in R.
>
> Your first question: can you show us the data for the genes where you
> know that they are differentially expressed? Perhaps then it might
> become more apparent why DESeq / nbinomtest did not agree. Also, what
> does the dispersion plot for cds look like? (This is the plot produced
> by plotDispEsts in the vignette).
>
> 	Best wishes
> 	Wolfgang
> #####################################################################
>
> narges [guest] scripsit 06/26/2012 06:17 PM:
>>
>> Hi all
>>
>> I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me.
>>
>>    -- output of sessionInfo():
>>
>> res <- nbinomTest(cds, "Male", "Female")
>>
>>> head(res)
>>                  id   baseMean baseMeanA  baseMeanB foldChange log2FoldChange
>> 1 ENSG00000000003  0.1130534  0.000000  0.2261067        Inf            Inf
>> 2 ENSG00000000005  0.0000000  0.000000  0.0000000        NaN            NaN
>> 3 ENSG00000000419 14.3767155 17.162610 11.5908205  0.6753530     -0.5662863
>> 4 ENSG00000000457 17.0174761 15.342800 18.6921526  1.2183013      0.2848710
>> 5 ENSG00000000460  3.9414822  2.855099  5.0278659  1.7610131      0.8164056
>> 6 ENSG00000000938 16.0894945 18.350117 13.8288718  0.7536122     -0.4081058
>>          pval padj
>> 1 0.9959638    1
>> 2        NA   NA
>> 3 0.3208560    1
>> 4 0.5942512    1
>> 5 0.4840607    1
>> 6 0.5409953    1
>>
>>
>>> res1 <- res[res$padj<0.1,]
>>> head(res1)
>>          id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
>> NA   <NA>       NA        NA        NA         NA             NA   NA   NA
>> NA.1 <NA>       NA        NA        NA         NA             NA   NA   NA
>> NA.2 <NA>       NA        NA        NA         NA             NA   NA   NA
>> NA.3 <NA>       NA        NA        NA         NA             NA   NA   NA
>> NA.4 <NA>       NA        NA        NA         NA             NA   NA   NA
>> NA.5 <NA>       NA        NA        NA         NA             NA   NA   NA
>>
>> my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes
>>
>> Thank you so much
>> Regards
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber