[BioC] DESeq analysis
Fatemehsadat Seyednasrollah
fatsey at utu.fi
Tue Jul 3 13:56:09 CEST 2012
Hi,
First thanks a lot for your answer.
Actually I have used a subset of a public data from Bowtie(the Montgomery)(http://bowtie-bio.sourceforge.net/recount/)
and below are the reduced
codes of my work both from edgeR and DEseq. I wanted to know have I done
something wrong to obtain
very different answers ( 85 from DESeq and 407 from edgeR) or it is natural
to have this hude difference
and it is related to the algorithms?
edgeR:
> g1 <- read.delim ("count1.txt", row.names = 1)
> head(g1)
NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
ENSG00000000003 0 0 0 0 1 0 0
ENSG00000000005 0 0 0 0 0 0 0
ENSG00000000419 10 24 19 20 19 8 14
ENSG00000000457 17 15 13 18 21 18 21
ENSG00000000460 2 3 5 2 4 6 8
ENSG00000000938 20 4 35 16 10 17 19
NA10847F
ENSG00000000003 0
ENSG00000000005 0
ENSG00000000419 6
ENSG00000000457 15
ENSG00000000460 2
ENSG00000000938 9
> group <- factor(rep(c("Male", "Female"), each= 4))
> dge <- DGEList(counts = g1 , group = group )
Calculating library sizes from column totals.
> dge <- calcNormFactors(dge)
> dge <- estimateCommonDisp(dge)
> sqrt (dge$common.dispersion)
[1] 0.3858996
> test <- exactTest(dge)
> head(test$table)
logFC logCPM PValue
ENSG00000000003 -2.3441897 -3.042057 1.0000000
ENSG00000000005 0.0000000 -Inf 1.0000000
ENSG00000000419 0.5777309 3.850993 0.2791539
ENSG00000000457 -0.3054489 4.080866 0.5592668
ENSG00000000460 -0.7792622 1.966865 0.3274528
ENSG00000000938 0.3909100 3.997866 0.4269672
> sum (test$table$PValue <0.01)
[1] 407
DESeq:
> g1 <- read.table("count1.txt", header = TRUE, row.names = 1)
> conds <- factor(rep(c("Male", "Female"), each= 4))
> dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4))
> dataPack
condition
NA06994M Male
NA07051M Male
NA07347M Male
NA07357M Male
NA07000F Female
NA07037F Female
NA07346F Female
NA10847F Female
> cds <- newCountDataSet(g1, conds)
> head(cds)
CountDataSet (storageMode: environment)
assayData: 1 features, 8 samples
element names: counts
protocolData: none
phenoData
sampleNames: NA06994M NA07051M ... NA10847F (8 total)
varLabels: sizeFactor condition
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
> head(counts(cds)
+ )
NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F
ENSG00000000003 0 0 0 0 1 0 0
ENSG00000000005 0 0 0 0 0 0 0
ENSG00000000419 10 24 19 20 19 8 14
ENSG00000000457 17 15 13 18 21 18 21
ENSG00000000460 2 3 5 2 4 6 8
ENSG00000000938 20 4 35 16 10 17 19
NA10847F
ENSG00000000003 0
ENSG00000000005 0
ENSG00000000419 6
ENSG00000000457 15
ENSG00000000460 2
ENSG00000000938 9
> cds <- estimateSizeFactors(cds)
> sizeFactors(cds)
NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F NA10847F
0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086
> head(counts(cds, normalized= TRUE))
> cds <- estimateDispersions(cds)
> result <- nbinomTest(cds, "Male", "Female")
> nrow(subset(result, result$pval <0.01))
[1] 85
Again thank you so much
With Best Regards,
Narges_
####################################################################
Dear Narges
thank you for the feedback. Your second question is easy: use the idiom
res1 <- subset(res, padj<0.1)
instead, this will avoid the creation of rows full of NA whenever
res$padj is NA. Alternatively
res[order(res$padj)[1:n], ]
with 'n' your favourite lucky number might be useful. Have a look at the
R-intro manual for more on subsetting of arrays and dataframes in R.
Your first question: can you show us the data for the genes where you
know that they are differentially expressed? Perhaps then it might
become more apparent why DESeq / nbinomtest did not agree. Also, what
does the dispersion plot for cds look like? (This is the plot produced
by plotDispEsts in the vignette).
Best wishes
Wolfgang
#####################################################################
narges [guest] scripsit 06/26/2012 06:17 PM:
>
> Hi all
>
> I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me.
>
> -- output of sessionInfo():
>
> res <- nbinomTest(cds, "Male", "Female")
>
>> head(res)
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange
> 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf
> 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN
> 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863
> 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710
> 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056
> 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058
> pval padj
> 1 0.9959638 1
> 2 NA NA
> 3 0.3208560 1
> 4 0.5942512 1
> 5 0.4840607 1
> 6 0.5409953 1
>
>
>> res1 <- res[res$padj<0.1,]
>> head(res1)
> id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
> NA <NA> NA NA NA NA NA NA NA
> NA.1 <NA> NA NA NA NA NA NA NA
> NA.2 <NA> NA NA NA NA NA NA NA
> NA.3 <NA> NA NA NA NA NA NA NA
> NA.4 <NA> NA NA NA NA NA NA NA
> NA.5 <NA> NA NA NA NA NA NA NA
>
> my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes
>
> Thank you so much
> Regards
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list