[R] cor() alternative for huge data set

Jyotasana Gulati jgulati at ice.mpg.de
Thu Sep 30 14:05:09 CEST 2010


Peter, Many thank for suggesting me this package. I very much believe that this will help me. But I was trying to correlate all probes(correlation between entities not variables) to calculate differentially coexpressed gene sets using package coXpress in R. I could not reduce the number on the basis of intensity, since most of the genes are down regulated and upregulated in treated conditions, so they are of my interest and cannot be removed from control samples(since I have to compare both). 

can you further suggest me an alternative for differentially coexpression analysis, since this is what I need to know the most-- the sets which are behaving differently across conditions. 

Has any one ever used this package--coXpress?? 

Regards
..
Jyotasana 
----- Original Message -----
From: "Peter Langfelder" <peter.langfelder at gmail.com>
To: "Jyotasana Gulati" <jgulati at ice.mpg.de>
Cc: r-help at r-project.org
Sent: Thursday, September 30, 2010 4:05:44 AM
Subject: Re: [R] cor() alternative for huge data set

On Wed, Sep 29, 2010 at 1:27 PM, Jyotasana Gulati <jgulati at ice.mpg.de> wrote:
> Hi,
>
> I am have a data set of around 43000 probes(rows), and have to calculate correlation matrix. When I run cor function in R, its throwing an error message of RAM shortage which was obvious for such huge number of rows.  I am not getting a logical way to cut off this huge number of entities, is there an alternative to pearson correlation or with other dist() methods calculation(euclidean) that can be run on such a huge data set??
> Every help will be appreciated.

Hmm... Are you calculating a correlation of 43000 probes, or of some
number of samples across 43000 probes? If the former, read below. If
the latter, I'm surprised you are running out of memory. Issuing
garbage collection (gc()) before the calculation, closing all other
programs, removing all other large objects from the R workspace etc.
may help.

If you really need the 43k times 43k correlation matrix of your 43k
probes, read on.
[Disclosure: this is a shameless plug for the package WGCNA (Weighted
Gene Co-expression Network Analysis, also known as Weighted
Correlation Network Analysis), from the package author, namely me.]

First, since the distance matrix will be huge, you will not gain using
other distance methods either.

Second, depending on what you want to do with the 43k probes, the
package WGCNA may help you. It has methods for creating correlation
networks among a large number of probes. The idea is to pre-cluster
the probes using what I call projective K-means, function
projectiveKMeans. The pre-clustering will return what we call blocks
of probes (or genes). We assume (this is a big assumption) that
correlations among probes belonging to different blocks can be
neglected. Then we treat each block separately for network
construction (or, in your case, possibly simple calculation of
correlation).

Although this isn't strictly an R topic but rather microarray analysis
issue, in my experience it is often useful to filter out probes before
actually calculating and interpreting large correlation matrices. In
conjunction with filtering, it can be advantageous to only keep one
probe per gene (presumably there is more than one probe per gene in
you data set). The filtering criterion varies from analysis to
analysis, but if your data represent intensities, it is often a good
idea to throw away probes whose intensity is always low, because such
signals are mostly noise.

If you decide to check out WGCNA, look at
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/.

Peter



More information about the R-help mailing list