[BioC] edgeR: confusing BCV plot
James W. MacDonald
jmacdon at uw.edu
Wed Sep 12 15:29:08 CEST 2012
Hi Natasha,
On 9/12/2012 5:05 AM, Natasha Sahgal wrote:
> Dear List,
>
> Sorry for any cross-posting.
>
> I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design.
> The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group.
> My aim: to detect DE genes based on the effect of stimulus on ko cells.
>
> However, I do not understand the output from the BCV plot after estimating dispersion.
> Thus would appreciate any help/advice/suggestions.
What about the output do you not understand?
>
> Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section).
You didn't filter in such a way to remove all genes with zero counts. In
fact you didn't say anything about zero counts - instead what you did
was to require at least four samples to have more than 1 or 2 or 3
counts per million. The other four samples were unconstrained, and could
easily have had zero counts.
Note that this is a reasonable thing to do. You are looking for genes
where one sample had more transcripts than the other sample. This
includes the situation where one of the samples appears not to
transcribe the gene at all.
Best,
Jim
>
> ------------------------------------
> Code:
> dim(gene.counts.2) # 33607 8
> ## Sample Descriptions
> group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2)))
>
> ## Creating dge list
> y = DGEList(counts=gene.counts.2, group=group)
>
> ## Filtering
> keep = rowSums(cpm(y)>1)>= 4
> table(keep)
> #FALSE TRUE
> #17678 15929
> keep2 = rowSums(cpm(y)>2)>= 4
> table(keep2)
> #FALSE TRUE
> #19300 14307
> keep3 = rowSums(cpm(y)>3)>= 4
> table(keep3)
> #FALSE TRUE
> #20229 13378
>
> y.filt = y[keep, ]
> dim(y.filt$counts) # 15929 8
>
> y.filt2 = y[keep2, ]
> dim(y.filt2$counts) # 14307 8
>
> y.filt3 = y[keep3, ]
> dim(y.filt3$counts) # 13378 8
>
> ## Recalculate lib.size
> y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts)
>
> ## Normalisation
> y.filt = calcNormFactors(y.filt)
>
> range(y.filt$counts[,1]) # 0 159659
> range(y.filt$counts[,2]) # 0 155390
> range(y.filt$counts[,3]) # 0 122249
> range(y.filt$counts[,4]) # 0 137046
> range(y.filt$counts[,5]) # 0 206528
> range(y.filt$counts[,6]) # 0 222176
> range(y.filt$counts[,7]) # 0 192333
> range(y.filt$counts[,8]) # 0 229413
>
> y.filt2 = calcNormFactors(y.filt2)
>
> range(y.filt2$counts[,1]) # 0 159659
> range(y.filt2$counts[,2]) # 0 155390
> range(y.filt2$counts[,3]) # 0 122249
> range(y.filt2$counts[,4]) # 0 137046
> range(y.filt2$counts[,5]) # 0 206528
> range(y.filt2$counts[,6]) # 0 222176
> range(y.filt2$counts[,7]) # 0 192333
> range(y.filt2$counts[,8]) # 0 229413
>
> y.filt3 = calcNormFactors(y.filt3)
>
> range(y.filt3$counts[,1]) # 0 159659
> range(y.filt3$counts[,2]) # 0 155390
> range(y.filt3$counts[,3]) # 0 122249
> range(y.filt3$counts[,4]) # 0 137046
> range(y.filt3$counts[,5]) # 0 206528
> range(y.filt3$counts[,6]) # 0 222176
> range(y.filt3$counts[,7]) # 0 192333
> range(y.filt3$counts[,8]) # 0 229413
>
> ## MDS plots
> plotMDS(y.filt, main="cpm(y)>1")
> plotMDS(y.filt2, main="cpm(y)>2")
> plotMDS(y.filt3, main="cpm(y)>3")
>
> ## Design Matrix
> design = model.matrix(~0+group)
> colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim
> #1 1 0 0 0
> #2 1 0 0 0
> #3 0 1 0 0
> #4 0 1 0 0
> #5 0 0 1 0
> #6 0 0 1 0
> #7 0 0 0 1
> #8 0 0 0 1
>
> ## Estimating Dispersion
> y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661
> y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646
> y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632
>
>
> y.filt = estimateGLMTrendedDisp(y.filt,design)
> y.filt2 = estimateGLMTrendedDisp(y.filt2,design)
> y.filt3 = estimateGLMTrendedDisp(y.filt3,design)
>
> y.filt = estimateGLMTagwiseDisp(y.filt,design)
> y.filt2 = estimateGLMTagwiseDisp(y.filt2,design)
> y.filt3 = estimateGLMTagwiseDisp(y.filt3,design)
>
> jpeg("BCVplots.jpg",height=500,width=900)
> par(mfrow=c(1,3))
> plotBCV(y.filt, main="cpm(y)>1")
> plotBCV(y.filt2, main="cpm(y)>2")
> plotBCV(y.filt3, main="cpm(y)>3")
> dev.off()
> (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg)
>
> ### NOT RUN this section
> ## Fit Model
> fit = glmFit(y.filt, design)
> cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design)
> lrt = glmLRT(y.filt, fit, contrast=as.vector(cont))
> ------------------------------------
>
> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7
> [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0
> [7] edgeR_2.6.10 limma_3.12.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5
> [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0
> [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5
> [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14
> [13] tools_2.15.0 xtable_1.7-0
> ------------------------------------
>
> Many Thanks,
> Natasha
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list