[BioC] DESeq - plotPCA
James W. MacDonald
jmacdon at uw.edu
Wed Mar 6 15:52:08 CET 2013
On 3/6/2013 2:43 AM, Zaki Fadlullah [guest] wrote:
> Hi mailing list,
> I have a question regarding the plotPCA function in DESeq.
> Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes??
> plotPCA = function(x, intgroup, ntop=500)
> rv = rowVars(exprs(x))
> select = order(rv, decreasing=TRUE)[seq_len(ntop)]
> pca = prcomp(t(exprs(x)[select,]))
> fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : "))
> colours = brewer.pal(nlevels(fac), "Paired")
> pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2,
> aspect = "iso", col=colours,
> main = draw.key(key = list(
> rect = list(col = colours),
> text = list(levels(fac)),
> rep = FALSE)))
> Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot??
It's the genes that show the highest variability across samples. The
relevant part of the code is
rv = rowVars(exprs(x))
select = order(rv, decreasing=TRUE)[seq_len(ntop)]
pca = prcomp(t(exprs(x)[select,]))
where rowVars computes row-wise variances.
> Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together??
> More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)?
Not necessarily. PCA is a linear transformation that is intended to
capture the greatest variability between samples. So the 500 most
variable genes will have a larger effect on the result than the next 500
genes, etc. It's a tradeoff between computational efficiency and getting
the most information out of your data.
You can try it yourself by increasing the ntop argument and seeing how
the resulting PCA plot changes. In most situations I wouldn't expect
much difference between a PCA plot of the top 500 genes and all of them.
Anyway, I think you miss the point of PCA. It isn't to see if genes
cluster together, it is intended as a way to see if samples of the same
type cluster together. The clustering pattern can tell you things about
your samples, like if a particular sample is quite different from others
of the same type, or if there is a batch effect.
> Appreciate any insight into this matter as I am new in R and RNA-seq
> Many thanks
> -- output of sessionInfo():
> not relevant
> Sent via the guest posting facility at bioconductor.org.
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor