[BioC] DESeq - plotPCA

James W. MacDonald jmacdon at uw.edu
Wed Mar 6 15:52:08 CET 2013

Hi Zaki,

On 3/6/2013 2:43 AM, Zaki Fadlullah [guest] wrote:
> Hi mailing list,
> I have a question regarding the plotPCA function in DESeq.
> Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes??
> plotPCA = function(x, intgroup, ntop=500)
> {
>    require("lattice")
>    require("genefilter")
>    rv = rowVars(exprs(x))
>    select = order(rv, decreasing=TRUE)[seq_len(ntop)]
>    pca = prcomp(t(exprs(x)[select,]))
>    fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : "))
>    colours = brewer.pal(nlevels(fac), "Paired")
>    pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2,
>      aspect = "iso", col=colours,
>      main = draw.key(key = list(
>        rect = list(col = colours),
>        text = list(levels(fac)),
>        rep = FALSE)))
> }
> ---
> Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot??

It's the genes that show the highest variability across samples. The 
relevant part of the code is

rv = rowVars(exprs(x))
select = order(rv, decreasing=TRUE)[seq_len(ntop)]
pca = prcomp(t(exprs(x)[select,]))

where rowVars computes row-wise variances.

> Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together??
> More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)?

Not necessarily. PCA is a linear transformation that is intended to 
capture the greatest variability between samples. So the 500 most 
variable genes will have a larger effect on the result than the next 500 
genes, etc. It's a tradeoff between computational efficiency and getting 
the most information out of your data.

You can try it yourself by increasing the ntop argument and seeing how 
the resulting PCA plot changes. In most situations I wouldn't expect 
much difference between a PCA plot of the top 500 genes and all of them.

Anyway, I think you miss the point of PCA. It isn't to see if genes 
cluster together, it is intended as a way to see if samples of the same 
type cluster together. The clustering pattern can tell you things about 
your samples, like if a particular sample is quite different from others 
of the same type, or if there is a batch effect.



> Appreciate any insight into this matter as I am new in R and RNA-seq
> Many thanks
> Zaki
>   -- output of sessionInfo():
> not relevant
> --
> Sent via the guest posting facility at bioconductor.org.
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

More information about the Bioconductor mailing list