[BioC] DESeq variance stabilisation and clustering
Simon Anders
anders at embl.de
Wed Mar 23 11:10:20 CET 2011
Hi Timothy
On 03/23/2011 10:47 AM, Timothy Hughes wrote:
> We wish to perform clustering on expression data and therefore are
> interested in the variance-stabilizing transformation of DESeq. I understand
> what the purpose of the transformation is namely to produce values whose
> variances are approximately the same, but why is it necessary to do this
> when computing the distance between two values? Or put another way, in what
> way does hierarchical clustering make assumptions about similar variances?
>
> I believe I have the answer, but it would be nice if someone could confirm
> this.
>
> When doing clustering one is often effectively trying to minimize the
> variance within a cluster even if this is not explicitly defined. If we
> consider that the observations being clustered are random variables with a
> variance then we should explicitly account for this variance and use a
> variance stabilising transformation. This avoids the need for trying to
> account for the variance in the clustering process.
[...]
When talking about clustering, it is important to get clear on what you
are clustering: samples or genes?
In the DESeq vignette, I am clustering samples, i.e., I want to see
which samples are similar to each other, hoping to find that replicate
samples appear more similar than samples from different conditions. For
this, I need to measure of distance between samples. To compare to
samples, one usually takes the two vectors with the expression values of
all genes in the respective sample and calculates the distance between
these vectors. If one uses Euclidean distance, one calculated, for each
gene, the difference of expression between the two samples, squares all
these differences, adds up the squares and takes the square root.
You want all genes to have roughly equal influence on the distance, and
for this, all genes should have equal variance. If you use raw counts,
the variance of the top ten-or-so most strongly expressed genes have so
much more variance that all the other genes have hardly any influence.
DESeq's VST rectifies this.
So, my motivation to add the VST to DESeq was to give the user a
possibility to calculate distances about
You seem to be talking about clustering genes, not samples, however. I
hd not thought yet about this application, but I think, your explanation
goes the right way.
As strong genes have strong variance in all samples, all samples will
contribute equally to any measure of distance between two genes. So, we
don't have the issue I just discussed that different components
influencing the distance have unequal weight. However, the variance of
the distance measure itself is now vastly different between weak and
strong genes. Two strong genes which actually behave similarly will not
cluster together because their large values will give amplify the noise
contributions to the distance, while two weak genes will always have
small distance because their small expression values also lets their
distance appear small. Again, the VST changes the scales such that
typical distances (as difference, not ratio) between genes become
independent of overall expression strength.
Simon
More information about the Bioconductor
mailing list