[BioC] DESeq analysis
Wolfgang Huber
whuber at embl.de
Thu Jul 5 09:31:58 CEST 2012
Dear Fatemehsadat Seyednasrollah
this type of differences in the results may reflect the different
approaches to estimating dispersion of the two packages and is not
completely uncommon. I'd recommend to study the genes where the results
differ in more detail, i.e. look at their data and possibly their
biological roles.
If your samples are 1000 Genomes cell lines, I would expect large
variability between them, and indeed not that much signal in a 4 against
4 comparison.
Best wishes
Wolfgang
Jul/3/12 1:56 PM, Fatemehsadat Seyednasrollah scripsit::
> Hi,
> First thanks a lot for your answer.
> Actually I have used a subset of a public data from Bowtie(the Montgomery)(http://bowtie-bio.sourceforge.net/recount/)
>
> and below are the reduced
> codes of my work both from edgeR and DEseq. I wanted to know have I done
> something wrong to obtain
> very different answers ( 85 from DESeq and 407 from edgeR) or it is natural
> to have this hude difference
> and it is related to the algorithms?
>
> edgeR:
>> g1 <- read.delim ("count1.txt", row.names = 1)
>> head(g1)
> NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
> ENSG00000000003 0 0 0 0 1 0 0
> ENSG00000000005 0 0 0 0 0 0 0
> ENSG00000000419 10 24 19 20 19 8 14
> ENSG00000000457 17 15 13 18 21 18 21
> ENSG00000000460 2 3 5 2 4 6 8
> ENSG00000000938 20 4 35 16 10 17 19
> NA10847F
> ENSG00000000003 0
> ENSG00000000005 0
> ENSG00000000419 6
> ENSG00000000457 15
> ENSG00000000460 2
> ENSG00000000938 9
>> group <- factor(rep(c("Male", "Female"), each= 4))
>> dge <- DGEList(counts = g1 , group = group )
> Calculating library sizes from column totals.
>> dge <- calcNormFactors(dge)
>> dge <- estimateCommonDisp(dge)
>> sqrt (dge$common.dispersion)
> [1] 0.3858996
>> test <- exactTest(dge)
>> head(test$table)
> logFC logCPM PValue
> ENSG00000000003 -2.3441897 -3.042057 1.0000000
> ENSG00000000005 0.0000000 -Inf 1.0000000
> ENSG00000000419 0.5777309 3.850993 0.2791539
> ENSG00000000457 -0.3054489 4.080866 0.5592668
> ENSG00000000460 -0.7792622 1.966865 0.3274528
> ENSG00000000938 0.3909100 3.997866 0.4269672
>> sum (test$table$PValue <0.01)
> [1] 407
>
>
> DESeq:
>
>> g1 <- read.table("count1.txt", header = TRUE, row.names = 1)
>> conds <- factor(rep(c("Male", "Female"), each= 4))
>> dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4))
>> dataPack
> condition
> NA06994M Male
> NA07051M Male
> NA07347M Male
> NA07357M Male
> NA07000F Female
> NA07037F Female
> NA07346F Female
> NA10847F Female
>> cds <- newCountDataSet(g1, conds)
>> head(cds)
> CountDataSet (storageMode: environment)
> assayData: 1 features, 8 samples
> element names: counts
> protocolData: none
> phenoData
> sampleNames: NA06994M NA07051M ... NA10847F (8 total)
> varLabels: sizeFactor condition
> varMetadata: labelDescription
> featureData: none
> experimentData: use 'experimentData(object)'
> Annotation:
>> head(counts(cds)
> + )
> NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
> ENSG00000000003 0 0 0 0 1 0 0
> ENSG00000000005 0 0 0 0 0 0 0
> ENSG00000000419 10 24 19 20 19 8 14
> ENSG00000000457 17 15 13 18 21 18 21
> ENSG00000000460 2 3 5 2 4 6 8
> ENSG00000000938 20 4 35 16 10 17 19
> NA10847F
> ENSG00000000003 0
> ENSG00000000005 0
> ENSG00000000419 6
> ENSG00000000457 15
> ENSG00000000460 2
> ENSG00000000938 9
>> cds <- estimateSizeFactors(cds)
>> sizeFactors(cds)
> NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F NA10847F
> 0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086
>> head(counts(cds, normalized= TRUE))
>
>> cds <- estimateDispersions(cds)
>> result <- nbinomTest(cds, "Male", "Female")
>> nrow(subset(result, result$pval <0.01))
> [1] 85
>
> Again thank you so much
> With Best Regards,
> Narges_
>
> ####################################################################
>
> Dear Narges
>
> thank you for the feedback. Your second question is easy: use the idiom
> res1 <- subset(res, padj<0.1)
> instead, this will avoid the creation of rows full of NA whenever
> res$padj is NA. Alternatively
> res[order(res$padj)[1:n], ]
> with 'n' your favourite lucky number might be useful. Have a look at the
> R-intro manual for more on subsetting of arrays and dataframes in R.
>
> Your first question: can you show us the data for the genes where you
> know that they are differentially expressed? Perhaps then it might
> become more apparent why DESeq / nbinomtest did not agree. Also, what
> does the dispersion plot for cds look like? (This is the plot produced
> by plotDispEsts in the vignette).
>
> Best wishes
> Wolfgang
> #####################################################################
>
> narges [guest] scripsit 06/26/2012 06:17 PM:
>>
>> Hi all
>>
>> I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me.
>>
>> -- output of sessionInfo():
>>
>> res <- nbinomTest(cds, "Male", "Female")
>>
>>> head(res)
>> id baseMean baseMeanA baseMeanB foldChange log2FoldChange
>> 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf
>> 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN
>> 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863
>> 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710
>> 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056
>> 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058
>> pval padj
>> 1 0.9959638 1
>> 2 NA NA
>> 3 0.3208560 1
>> 4 0.5942512 1
>> 5 0.4840607 1
>> 6 0.5409953 1
>>
>>
>>> res1 <- res[res$padj<0.1,]
>>> head(res1)
>> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
>> NA <NA> NA NA NA NA NA NA NA
>> NA.1 <NA> NA NA NA NA NA NA NA
>> NA.2 <NA> NA NA NA NA NA NA NA
>> NA.3 <NA> NA NA NA NA NA NA NA
>> NA.4 <NA> NA NA NA NA NA NA NA
>> NA.5 <NA> NA NA NA NA NA NA NA
>>
>> my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes
>>
>> Thank you so much
>> Regards
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list