[BioC] R help with large matrices

Wed Apr 9 15:08:43 CEST 2008

>That said, you may want to describe what you are trying to do ...
>... In particular, there are numerous CGH array packages.

We use snapCGH - the output of which is segmented data with a call status. 

To automate the determination of common regions of CNI and also minimum
regions of CNI across the whole sample set, additional functions are
required.

The approach we've taken is to systematically look at the call state of each
clone across the sample set, and iterations of clones from a starting point
to the end of the chromosome for each sample. Score matrix here actually
provides the index number of the start and stop clones for a particular
genomic region ... that need to be scored. In essence:
   1 2 3 4 5
+1 2 3 4 5 6
+2 3 4 5 6 7
+3 4 5 6 7 8
+4 5 6 7 8 9

The matrix column headers are clone index numbers along the length of a
particular chromosome, and the rows give increments for the length of
genomic region being assayed. Hence, the matrix is looped through (col x
row) and coordinates used to retrieve the call states of the samples
contained therein.  That's how I'm currently generating my 'all
permutations' index coordinates for the calls comparison and I think this is
the bit that is bringing me down - its too memory intensive...

A second large matrix function records the outcome of the comparison of call
states between the index points of score matrix.  In fact, there are two
results matrix lists (one for gain, and one for loss) resultGain[[1]] stores
the binary outcome, while resultGainP[[1]] records the percentage of samples
that brought about the outcome.

>Unless you have a pretty big machine, you will probably not be able to
>fit a 19,300 x 19,300 member matrix into memory.

The work is being undertaken on CamGrid.
http://www.escience.cam.ac.uk/projects/camgrid/

Thanks for any suggestions!
Ian

-----Original Message-----
From: seandavi at gmail.com [mailto:seandavi at gmail.com] On Behalf Of Sean Davis
Sent: 09 April 2008 13:41
To: Ian Roberts
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] R help with large matrices

On Wed, Apr 9, 2008 at 6:31 AM, Ian Roberts <ir210 at cam.ac.uk> wrote:
> Dear All,
>
>  I'm having memory problems producing large matrices and wonder if there
is a
>  better way to do what I need?  I'm a novice R user, so here's what I've
done
>
>  I need to permute a scoring matrix up to 19,300 events.
>  That is, I have a vector of length 19,300 results and need to compare
each with
>  each other for all possible permutations therein of the type N!/(N-n)!

Unless you have a pretty big machine, you will probably not be able to
fit a 19,300 x 19,300 member matrix into memory.  Why not use a random
sampling of a set size (say, 1000 events)?  Sample() is the function
that chooses random samples.

That said, you may want to describe what you are trying to do, rather
than asking how to do it.  There may be a bioconductor package that
already answers the question you are trying to answer.  In particular,
there are numerous CGH array packages.

Sean

>  I've tried writing my own, and the permutations function of package
gtools,
>  however both run into trouble with vectors in excess of 1000 events.
>
>  Essentially, the score matrix provides start and stop clone numbers for
an
>  ordered gene list.  It works well for BAC arrays, but is failing for
Agilent
>  244K oligo arrays!!!