[BioC] HEATMAP on LARGE DATA

Sean Davis sdavis2 at mail.nih.gov
Tue Mar 14 04:22:53 CET 2006




On 3/13/06 21:08, "mark salsburg" <mark.salsburg at gmail.com> wrote:

> I am having trouble getting the function heatmap() to work on the following
> gene expression
> 
>> dim(SAMPLES_log)
> [1] 12626    20
> 
> 
>            sample1 sample2...................sample20
> gen1
> gen2
> gen3
> ....
> gen12626
> 
> 
> 
> I have converted SAMPLES_log to a numeric matrix using:
> 
> as.matrix(SAMPLES_log)
> 
> when I use the following command:
> 
> heatmap(SAMPLES_log)
> 
> Error: cannot allocate vector of size 622668 Kb
> In addition: Warning messages:
> 1: Reached total allocation of 1022Mb: see help(memory.size)
> 2: Reached total allocation of 1022Mb: see help(memory.size)

Mark,

In order to do a heatmap on 12000 genes, a triangular matrix of size
12000x12000/2 needs to be calculated.  This is large and will often result
in the out-of-memory error that you see.  I don't often find that clustering
that many genes is meaningful in any major way, particularly since you will
be including a large number of genes that do not vary in the samples.  If
you really need to do this, I would suggest that you use an external program
like cluster/treeview, as they may be somewhat less memory-hungry than R
(but I haven't tested that directly).

> Is there some library in BioConductor that will allow me to output a
> heatmap. I want to compare the expression of the first 10 samples with the
> last 10 samples.

If you want to do an unsupervised clustering of samples, use just hclust.

If you want to do an unsupervised clustering of samples AND genes, I would
suggest reducing the number of genes using a filter for genes that show
variability (by using, say, the top 500 genes when sorted by coefficient of
variation, for example).  In other words, there is no need to include a gene
in a heatmap that is the same for all samples.

Ultimately, though, if you want to compare gene expression in two groups of
samples, you are asking a question that is best answered using a supervised
method, like a t-test.  There are numerous ways to do a t-test between two
groups including the limma, siggenes, and multtest packages.

Hope that helps.

Sean



More information about the Bioconductor mailing list