[BioC] DESeq2 Regularised Log for Clustering of Genes

Fri May 9 14:19:37 CEST 2014

Hi Dario

On Wed, May 7, 2014 at 11:00 PM, Dario Strbenac wrote:

>> As section 5.3 of the vignette explains, the transformed data can
>> be used for applications like clustering of samples. I was
>> considering the best way to use it instead for clustering genes of
>> a time-series experiment. I would have to account for gene length
>> to make different genes comparable.

Actually, no. I don't think accounting for gene length is necessary.

It depends on your distance metric: Do you want to consider two genes as 
similar (and hence would want them to cluster together) if they have 
similar absolute expression strength, or rather if they have a similar 
profile of _changes_ during the time course?

I would expect that the latter is more helpful for analysing time-course 
data, and that you will hence get biologically more meaningful clusters 
if you normalize each gene's expression by its expression strength at 
time 0. At the natural scale, this means division by, and at the log 
scale, subtraction of the time-0 (or: control) value. In either case, 
gene length cancels out.

This also means that, in case of a design with replicates or with 
factors besides time point, it might be preferable to not use DESeq2's 
rlog transform, but rather use DESeq2's normal wrokflow to estimate 
shrunken log fold changes for contrasts of all later time points against 
zero time and then perform clustering on these values. (Thinking about 
it, we should maybe consider adding a section in the vignette to 
demonstrate this approach.)

   Simon