[BioC] DESeq - plotPCA
Simon Anders
anders at embl.de
Wed Mar 6 16:16:36 CET 2013
Hi Zaki
The DESeq vignette discusses two different kinds of clustering or
ordination analysis, and you seem to have got them mixed up.
(i) Sample clustering: A commonly used quality assurance method is to
perform ordination methods such as principle component analysis (PCA),
multi-dimensional scaling (MDS) or hierarchical clustering (hclust) on
the _samples_, to see whether samples with the same experimental
treatment cluster together, and to check for batch effects. This is what
Figs. 16 and 17 in the vignette are about.
(ii) Gene clustering: As a downstream analysis, it can be helpful to see
whether _genes_ cluster together, to find groups of genes that react in
a common manner to the different treatments of the samples.
For both applications, PCA, MDS or hclust can be applied to the
variance-stabilized data. The difference is simply that for (ii), the
function (prcomp, isoMDS, dist, ...) is applied on the
variance-stabilized data matrix as is, while for (i), the matrix needs
to be transposed first.
For (i), it does not make much difference whether you use all data or
only highly variable genes, as genes with low variance across samples
provide only little information on sample distances anyway and so have
little influence on the result.
For (ii), it is common practice to subset to the most highly variant
genes because many ordination or clustering methods do not cope well
with a matrix with thousands of rows, and the genes with low variance
are unlikely to be part of interesting clusters anyway.
Simon
More information about the Bioconductor
mailing list