[BioC] DESeq - plotPCA

Zaki Fadlullah [guest] guest at bioconductor.org
Wed Mar 6 08:43:27 CET 2013

Hi mailing list,

I have a question regarding the plotPCA function in DESeq.

Looking into the plotPCA code I realised that the PCA function takes into account the 500 genes (ntop = 500 ,500 is just for an example, as this number can be adjusted). Am I correct in understanding that this 500 genes are the most variable genes??

plotPCA = function(x, intgroup, ntop=500)
  rv = rowVars(exprs(x))
  select = order(rv, decreasing=TRUE)[seq_len(ntop)]
  pca = prcomp(t(exprs(x)[select,]))

  fac = factor(apply(pData(vsdFull)[, intgroup], 1, paste, collapse=" : "))
  colours = brewer.pal(nlevels(fac), "Paired")

  pcafig = xyplot(PC2 ~ PC1, groups=fac, data=as.data.frame(pca$x), pch=16, cex=2,
    aspect = "iso", col=colours,
    main = draw.key(key = list(
      rect = list(col = colours),
      text = list(levels(fac)),
      rep = FALSE)))


Specifically what is actually meant by most variable genes?? and why would one use variable genes it in PCA plot?? 

Would a conclusion be is - If the 500 most variable gene cluster together (as seen from PCA plot [figure 17] in the DESeq vignttes), it means our expression data is good?? ... because even the most variable genes do group together?? 

More generally (not DESeq specific)...If the purpose of doing a PCA is to get a general overview on the data. Would it be best to do a PCA on all of the genes rather than a subset (say 500)? 

Appreciate any insight into this matter as I am new in R and RNA-seq

Many thanks

 -- output of sessionInfo(): 

not relevant

Sent via the guest posting facility at bioconductor.org.

More information about the Bioconductor mailing list