[BioC] clustering RNA-Seq data and performing gene set enrichment analysis

Thu Apr 12 23:21:52 CEST 2012

1) Is RNA-Seq data even appropriate for "standard" cluster analysis due to its
discrete nature?  What normalization should be done beforehand?  We tend to
perform length and TMM normalization of our data.

2) If we perform some sort of clustering of RNA-Seq data, and then obtain a gene
list from a cluster (e.g. all genes in a cluster) and then want to perform gene
set enrichment analysis on this gene list, is just using the Fisher's Exact Test
by itself ok or do we need to account for gene length (e.g. use GOSeq)?  I know
that RNA-Seq data has the bias that longer genes tend to be more often called
differentially expressed due to an increase in statistical power.  The issue
here is that longer genes --> more reads --> lower variance --> higher power to
detect differences? I am wondering if this difference in variance levels between
long and short genes would have an effect on the results of clustering?

Thanks,
-Julie