[Bioc-sig-seq] Overall directions (Martin Morgan)

Martin Morgan mtmorgan at fhcrc.org
Wed Mar 5 20:06:58 CET 2008


"Stephen Henderson" <s.henderson at ucl.ac.uk> writes:

> I think keep sequence data as binary fasta files (which all the current
> alignment/assembly tools use) and let exprSet handle where they are, and
> what they are. This should be fine unless you are a real poweruser in
> which case migrating to a db should be simple with the info you have
> collected.

One school of thought is very much along these lines -- the sequence
data is not really relational, so storing it in a database might not
be the right solution. The main challenge in terms of R is reading in
'chunks' at a time. The DB interfaces allow you to do this, but so
would maintaining a 'catalog' of files on disk (as for instance is
done with the solexa pipeline) or other approaches. There are more
performant solutions, and it would be good to hear from people
exploring those.

> When you need to import very large chunks of this for viewing/statistics
> have you thought about using a 2-bit datatype for each base: this is
> what the GenABEL package does with SNP calls (e.g. AA, BB, AB, NA). This
> should save a lot of redundant memory--as even a Raw uses 8-bits.

Yes, the GenABEL and other SNP packages are very good at this. The
Biostrings package represents DNA (and other) sequences as 'raw'
vectors that have reference semantics, meaning that they store (IUPAC)
DNA strings in a memory-efficient, read-only (so no extra copies made)
way. There is also effort to expose a C-level interface to Biostrings,
making for a very efficient way to access this data. This is probably
a very good starting point for those wishing to represent sequence
data. Perhaps Herve Pages will say a few words more...

A trade-off that developers might keep in mind is that many useful
algorithms then need to be re-written to operate on raw types. For
instance, a not-too-efficient way of calculating nucleotide
frequencies in a vector of reads represented as character sequences is

> table(unlist(strsplit(reads, "")))

This is much easier to implement than something at the C level, though
if it is a common operation then it makes a lot of sense to make the
effort (and is available with alphabetFrequency in Biostrings).

Martin

> Stephen Henderson
> Cancer Institute, Paul O'Gorman Building
> Huntley Street, University College London
> United Kingdom, WC1E 6BT
> +44 (0)207 679 6827
>  
>
> -----Original Message-----
> From: bioc-sig-sequencing-bounces at r-project.org
> [mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Simon
> Lin
> Sent: 05 March 2008 17:56
> To: bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Overall directions (Martin Morgan)
>
> I agree that SQL should be in the design mix. Here are a few additional 
> thoughts:
> 1) Let R do what it is good at -- statistics!
> 2) Reuse established sequence analysis methods: BLAST, assembly, 
> PHRED/PHRAP/CONSED etc
> 3) Define clear object structures in R, so wrappers can be used and
> differnt 
> algorithms can be tried using the same interface
> 4) Use SQL as a conduit between R and the sequence analysis results,
> because 
> of large size of the results (not only the raw data)
>
> Simon Lin
> Northwestern
>
> =============================
> Maybe a closing thought on this is that the data describing the
> experiment might belong in SQL tables (but also fit easily into R's
> memory), but it's less clear that the sequences belong in a relational
> data base. So some other format is likely appropriate for the big
> data. Here we've basically been using the disk-based storage
> structures implied by output of the Solexa (or other) software
> pipeline. Obviously a sub-optimal solution, and it would be great to
> hear solutions that other developers have explored.
>
> Martin
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> **********************************************************************
> This email and any files transmitted with it are confi...{{dropped:16}}



More information about the Bioc-sig-sequencing mailing list