[BioC] DESeq - plotPCA

Wed Mar 6 16:16:36 CET 2013

Hi Zaki

The DESeq vignette discusses two different kinds of clustering or 
ordination analysis, and you seem to have got them mixed up.

(i) Sample clustering: A commonly used quality assurance method is to 
perform ordination methods such as principle component analysis (PCA), 
multi-dimensional scaling (MDS) or hierarchical clustering (hclust) on 
the _samples_, to see whether samples with the same experimental 
treatment cluster together, and to check for batch effects. This is what 
Figs. 16 and 17 in the vignette are about.

(ii) Gene clustering: As a downstream analysis, it can be helpful to see 
whether _genes_ cluster together, to find groups of genes that react in 
a common manner to the different treatments of the samples.

For both applications, PCA, MDS or hclust can be applied to the 
variance-stabilized data. The difference is simply that for (ii), the 
function (prcomp, isoMDS, dist, ...) is applied on the 
variance-stabilized data matrix as is, while for (i), the matrix needs 
to be transposed first.

For (i), it does not make much difference whether you use all data or 
only highly variable genes, as genes with low variance across samples 
provide only little information on sample distances anyway and so have 
little influence on the result.

For (ii), it is common practice to subset to the most highly variant 
genes because many ordination or clustering methods do not cope well 
with a matrix with thousands of rows, and the genes with low variance 
are unlikely to be part of interesting clusters anyway.

   Simon