[Bioc-sig-seq] Overall directions (Martin Morgan)

Stephen Henderson s.henderson at ucl.ac.uk
Wed Mar 5 19:24:44 CET 2008


I think keep sequence data as binary fasta files (which all the current
alignment/assembly tools use) and let exprSet handle where they are, and
what they are. This should be fine unless you are a real poweruser in
which case migrating to a db should be simple with the info you have
collected.

When you need to import very large chunks of this for viewing/statistics
have you thought about using a 2-bit datatype for each base: this is
what the GenABEL package does with SNP calls (e.g. AA, BB, AB, NA). This
should save a lot of redundant memory--as even a Raw uses 8-bits.

Stephen Henderson
Cancer Institute, Paul O'Gorman Building
Huntley Street, University College London
United Kingdom, WC1E 6BT
+44 (0)207 679 6827
 

-----Original Message-----
From: bioc-sig-sequencing-bounces at r-project.org
[mailto:bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Simon
Lin
Sent: 05 March 2008 17:56
To: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] Overall directions (Martin Morgan)

I agree that SQL should be in the design mix. Here are a few additional 
thoughts:
1) Let R do what it is good at -- statistics!
2) Reuse established sequence analysis methods: BLAST, assembly, 
PHRED/PHRAP/CONSED etc
3) Define clear object structures in R, so wrappers can be used and
differnt 
algorithms can be tried using the same interface
4) Use SQL as a conduit between R and the sequence analysis results,
because 
of large size of the results (not only the raw data)

Simon Lin
Northwestern

=============================
Maybe a closing thought on this is that the data describing the
experiment might belong in SQL tables (but also fit easily into R's
memory), but it's less clear that the sequences belong in a relational
data base. So some other format is likely appropriate for the big
data. Here we've basically been using the disk-based storage
structures implied by output of the Solexa (or other) software
pipeline. Obviously a sub-optimal solution, and it would be great to
hear solutions that other developers have explored.

Martin

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

**********************************************************************
This email and any files transmitted with it are confide...{{dropped:7}}



More information about the Bioc-sig-sequencing mailing list