Hi Natasha,

I have been working on RNA-seq data from small sample numbers of small cell
numbers. It turned out the similar results you got for the BCV plot. In my
situation, there was an extra PCR step during processing the samples. I was
speculating if this step of PCR amplification might increase the variation
of highly expressed genes, thus gave me the strange BCV results, increasing
variation with higher CPM. Just a possible hint for this.


Best,

Wenhuo

On Wed, Sep 12, 2012 at 9:43 AM, Natasha Sahgal <nsahgal@well.ox.ac.uk>wrote:

> Dear Jim,
>
> Regarding the BCV plots, what I did not understand was the strange profile
> (at least strange to me), and the low coefficients of BV.
> Based on some figures from the user guide, it appeared to be very
> different - an increase towards the higher logCPM.
> 1) I'm not sure how to interpret these and if it is a good thing or not?
> (perhaps I have misunderstood the concept of the BCV)
> 2) How does this affect the differential expression, if at all?
>
> Re the filtering, for some reason I was under the impression increasing
> the counts per million would reduce (if not remove) zero counts in all
> samples. I agree with what you say about half the samples being
> unconstrained.
>
> I had 3 filters here, just to see what the difference would be. I still
> need to figure out the best or optimal one to use.
>
>
> Many Thanks and Best Wishes,
> Natasha
>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon@uw.edu]
> Sent: 12 September 2012 14:29
> To: Natasha Sahgal
> Cc: 'bioconductor@r-project.org'
> Subject: Re: [BioC] edgeR: confusing BCV plot
>
> Hi Natasha,
>
> On 9/12/2012 5:05 AM, Natasha Sahgal wrote:
> > Dear List,
> >
> > Sorry for any cross-posting.
> >
> > I have an RNA-Seq expt for which I'd like to use edgeR, as it is
> multifactorial in design.
> > The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated),
> n=2 in each group.
> > My aim: to detect DE genes based on the effect of stimulus on ko cells.
> >
> > However, I do not understand the output from the BCV plot after
> estimating dispersion.
> > Thus would appreciate any help/advice/suggestions.
>
> What about the output do you not understand?
>
> >
> > Also, would appreciate comments on the filtering step! As, it appears to
> me that I still have some genes with 0 read counts (as seen under the
> normalisation section).
>
> You didn't filter in such a way to remove all genes with zero counts. In
> fact you didn't say anything about zero counts - instead what you did was
> to require at least four samples to have more than 1 or 2 or 3 counts per
> million. The other four samples were unconstrained, and could easily have
> had zero counts.
>
> Note that this is a reasonable thing to do. You are looking for genes
> where one sample had more transcripts than the other sample. This includes
> the situation where one of the samples appears not to transcribe the gene
> at all.
>
> Best,
>
> Jim
>
>
> >
> > ------------------------------------
> > Code:
> > dim(gene.counts.2) # 33607     8
> > ## Sample Descriptions
> > group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2)))
> >
> > ## Creating dge list
> > y = DGEList(counts=gene.counts.2, group=group)
> >
> > ## Filtering
> > keep = rowSums(cpm(y)>1)>= 4
> > table(keep)
> > #FALSE  TRUE
> > #17678 15929
> > keep2 = rowSums(cpm(y)>2)>= 4
> > table(keep2)
> > #FALSE  TRUE
> > #19300 14307
> > keep3 = rowSums(cpm(y)>3)>= 4
> > table(keep3)
> > #FALSE  TRUE
> > #20229 13378
> >
> > y.filt = y[keep, ]
> > dim(y.filt$counts)  # 15929     8
> >
> > y.filt2 = y[keep2, ]
> > dim(y.filt2$counts) # 14307     8
> >
> > y.filt3 = y[keep3, ]
> > dim(y.filt3$counts) # 13378     8
> >
> > ## Recalculate lib.size
> > y.filt$samples$lib.size = colSums(y.filt$counts)
> > y.filt2$samples$lib.size = colSums(y.filt2$counts)
> > y.filt3$samples$lib.size = colSums(y.filt3$counts)
> >
> > ## Normalisation
> > y.filt = calcNormFactors(y.filt)
> >
> > range(y.filt$counts[,1]) #  0 159659
> > range(y.filt$counts[,2]) #  0 155390
> > range(y.filt$counts[,3]) #  0 122249
> > range(y.filt$counts[,4]) #  0 137046
> > range(y.filt$counts[,5]) #  0 206528
> > range(y.filt$counts[,6]) #  0 222176
> > range(y.filt$counts[,7]) #  0 192333
> > range(y.filt$counts[,8]) #  0 229413
> >
> > y.filt2 = calcNormFactors(y.filt2)
> >
> > range(y.filt2$counts[,1]) #  0 159659
> > range(y.filt2$counts[,2]) #  0 155390
> > range(y.filt2$counts[,3]) #  0 122249
> > range(y.filt2$counts[,4]) #  0 137046
> > range(y.filt2$counts[,5]) #  0 206528
> > range(y.filt2$counts[,6]) #  0 222176
> > range(y.filt2$counts[,7]) #  0 192333
> > range(y.filt2$counts[,8]) #  0 229413
> >
> > y.filt3 = calcNormFactors(y.filt3)
> >
> > range(y.filt3$counts[,1]) #  0 159659
> > range(y.filt3$counts[,2]) #  0 155390
> > range(y.filt3$counts[,3]) #  0 122249
> > range(y.filt3$counts[,4]) #  0 137046
> > range(y.filt3$counts[,5]) #  0 206528
> > range(y.filt3$counts[,6]) #  0 222176
> > range(y.filt3$counts[,7]) #  0 192333
> > range(y.filt3$counts[,8]) #  0 229413
> >
> > ## MDS plots
> > plotMDS(y.filt, main="cpm(y)>1")
> > plotMDS(y.filt2, main="cpm(y)>2")
> > plotMDS(y.filt3, main="cpm(y)>3")
> >
> > ## Design Matrix
> > design = model.matrix(~0+group)
> > colnames(design) = gsub("group","",colnames(design)) design #  KO
> KO_stim WT WT_stim
> > #1  1       0  0       0
> > #2  1       0  0       0
> > #3  0       1  0       0
> > #4  0       1  0       0
> > #5  0       0  1       0
> > #6  0       0  1       0
> > #7  0       0  0       1
> > #8  0       0  0       1
> >
> > ## Estimating Dispersion
> > y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp =
> > 0.0276 , BCV = 0.1661
> > y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp =
> > 0.02711 , BCV = 0.1646
> > y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp =
> > 0.02665 , BCV = 0.1632
> >
> >
> > y.filt = estimateGLMTrendedDisp(y.filt,design)
> > y.filt2 = estimateGLMTrendedDisp(y.filt2,design)
> > y.filt3 = estimateGLMTrendedDisp(y.filt3,design)
> >
> > y.filt = estimateGLMTagwiseDisp(y.filt,design)
> > y.filt2 = estimateGLMTagwiseDisp(y.filt2,design)
> > y.filt3 = estimateGLMTagwiseDisp(y.filt3,design)
> >
> > jpeg("BCVplots.jpg",height=500,width=900)
> > par(mfrow=c(1,3))
> > plotBCV(y.filt, main="cpm(y)>1")
> > plotBCV(y.filt2, main="cpm(y)>2")
> > plotBCV(y.filt3, main="cpm(y)>3")
> > dev.off()
> > (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg)
> >
> > ### NOT RUN this section
> > ## Fit Model
> > fit = glmFit(y.filt, design)
> > cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design)
> > lrt = glmLRT(y.filt, fit, contrast=as.vector(cont))
> > ------------------------------------
> >
> > sessionInfo()
> > R version 2.15.0 (2012-03-30)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> >
> > locale:
> >   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
> >   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
> >   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
> >   [7] LC_PAPER=C                 LC_NAME=C
> >   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] splines   stats     graphics  grDevices utils     datasets  methods
> > [8] base
> >
> > other attached packages:
> > [1] scatterplot3d_0.3-33 DESeq_1.8.3          locfit_1.5-7
> > [4] Biobase_2.16.0       BiocGenerics_0.2.0   WriteXLS_2.1.0
> > [7] edgeR_2.6.10         limma_3.12.0
> >
> > loaded via a namespace (and not attached):
> >   [1] annotate_1.34.0      AnnotationDbi_1.18.0 DBI_0.2-5
> >   [4] genefilter_1.38.0    geneplotter_1.34.0   grid_2.15.0
> >   [7] IRanges_1.14.2       lattice_0.20-6       RColorBrewer_1.0-5
> > [10] RSQLite_0.11.1       stats4_2.15.0        survival_2.36-14
> > [13] tools_2.15.0         xtable_1.7-0
> > ------------------------------------
> >
> > Many Thanks,
> > Natasha
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 


Wenhuo Hu
Park lab

Memorial Sloan Kettering Cancer Center
Zuckerman Research Building
408 East 69th Street
Room ZRC-527
New York, NY 10065
Phone 646-888-3220
huw@mskcc.org

	[[alternative HTML version deleted]]