[Bioc-sig-seq] (no subject) Changed: Overall directions

Mon Mar 3 21:22:13 CET 2008

Hi Stephen -- putting this back 'on-list' so everyone can participate;
sorry if this is not as intended...

> By front end I really mean R wrappers not GUIs. It sounds like a
> great idea to be able to do as much of my work as possible from
> within R. From what I have seen of SOAP the minimum necessary
> memory that programs use for alignment of illumina data is ca 16Gb
> of RAM. The alignment problem is very threadable but whilst this may
> speed processing I do not think this helps break the memory
> requirement down as the look up tables have to be stored as a single
> entity (as far as I understand).

> http://soap.genomics.org.cn/

> Incidentally SOAP alignment looks like it might integrate easily with R and is
> competitive for time and alignment with MAQ and Eland.

Yes developers should keep the data structure in mind, and the
opportunities for partitioning tasks. It seems like many operations
(though perhaps not alignments) can treat reads as independent of one
another, and this represents a natural way of dividing big tasks into
more memory-efficient and distributed ones. Chromosomal structure also
provides a natural way to think of how operations might be distributed
across processors.

A basic transformation seems to be from really very big data
(sequences) to just moderately big (e.g., alignments and scores;
apparent SNP polymorphisms; ...).

> An 'expressionSet like' object of pre-aligned or unassembled data which was
> stored in R would be a list of length ca. 40 million with strings of ca 25-30
> (single end). An output of alignment from SOAP would be a table of about 11
> columns (with original seq, QC, chr, positions, flags etc..), and a bit less
> than 40 million rows (if all are not aligned).

> Is the 'expressionSet like' object really going to store this? Or
> will it be a reference to a database or external file? I guess as
> you say you don't necessarily need to store all of this but can
> sample a lot of it for QC, plotting and analysis?

The main conceptual insights of the ExpressionSet are an association
of phenotype with 'data', and the abstraction of how the data is
represented internally from how the user interacts with it.

We've taken different approaches in our preliminary work (comments
from how other developers are dealing with these issues is most
welcome!). For QA types of operations it turns out to be fairly
effective to visit relatively small files (e.g., solexa lanes or
tiles) and summarize these into useful statistics for further
manipulation (e.g., reports and visualization) at the whole-run
level. For some exploratory alignment algorithms (see matchPDict in
the development version of the Biobase package) that require more
structured data representations, the approach is more
straight-forward: representing the data requires large memory
machines. Even here though there are some nuances, e.g., processing
each chromosome separately.

Maybe a closing thought on this is that the data describing the
experiment might belong in SQL tables (but also fit easily into R's
memory), but it's less clear that the sequences belong in a relational
data base. So some other format is likely appropriate for the big
data. Here we've basically been using the disk-based storage
structures implied by output of the Solexa (or other) software
pipeline. Obviously a sub-optimal solution, and it would be great to
hear solutions that other developers have explored.

Martin

------------------------------------------------------------------------------

From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
Sent: Fri 29/02/2008 17:23
To: Stephen Henderson
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] (no subject) Changed: Overall directions

"Stephen Henderson" <s.henderson at ucl.ac.uk> writes:
> OK
>
> Perhaps I can be first by asking what tasks you plan to cover? And how
> do you plan to implement them in R (given the memory restrictions)? Do
> you plan a nice front end for lots of C-code?
Hi Stephen --
It'll depend of course on who in the community steps up. Probably
packages will start as a standard R interface that gets the job done,
with pretty gui's later. Probably an early step (though perhaps not
the very first) will be settling on a common set of S4-style classes
to represent experiments and data, in the manner of an ExpressionSet.
>From our end, our first pass is to assume that computer resources are
not really an issue -- a 2 or 4 GB 32-bit operating system is not what
we're targeting.
Also in terms of preliminary experience, it seems like some operations
can be done effectively at the R level (data input and QA assessment)
but that some important steps (e.g., alignments) require clever data
structures and algorithms that get implemented in C. It's also
possible for some questions to exploit the structure of the data,
e.g., analyzing Solexa data in manageable chunks corresponding to
individual tiles.
Martin
> Stephen Henderson
>
> Cancer Institute, Paul O'Gorman Building
>
> Gower Street, University College London
>
> United Kingdom, WC1E 6BT
>
> +44 (0)207 679 6827
>
> 
>
>
> **********************************************************************
> This email and any files transmitted with it are confide...{{dropped:11}}
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793