[Bioc-sig-seq] Seeking advice on hardware options

Fri Jun 6 13:12:22 CEST 2008

On Thu, Jun 5, 2008 at 8:55 PM, Paul Leo <p.leo at uq.edu.au> wrote:
> I've been reading with great interest since Bioc-sig-seq started, we finally have a Solexa GA2... but still in the box so I have no hands on experience yet...
> Must admit I've not sorted out in my head what parts of the analysis are most suited R (my preference is as much as possible by but I understand some of the limitations...)
> Anyway the new hardware comes with a cluster of PCs (IPAR) that does the image analysis and they promise will do the base calling soon (not alignment?)
>

Hi, Paul.

You might want to look at http://seqanswers.com/ for some ideas.  That
said, I'll give you my $0.02 worth.

There are several good options for doing alignments including
R/Biostrings, MAQ, ELAND, SOAP, and others....

> Given this, and that we like using R when possible, and that we have some modest resources, can someone comment between the suggested server for the GA2 system: quad Xeon 7000 series with 32GB Ram verses a quad Xeon box with the 5000 series processor (3 GHz). We have some experience setting up the 5000 series box which is about ½ the cost of the 7000 series.  We do have High Performance computing at an off-site location but transferring LARGE files to that location may be an issue...so would like to do as much in-house as possible.
>

If you have IPAR, the total file size will not include the images, so
you may be able to go the off-site route.

> Do you find that the analysis you have done benefits greatly from the faster processor or would 2x 5000 boxes by better option in your opinion? Any comments on the amount of memory per processor, I was thinking that perhaps this should be higher than recommended if we like to use R? These things I need to resolve before I use the sequencer unfortunately. I see in some posts that people are using machines with 64GB of memory, is that typical?
>

We deal with 32GB on our 8-processor machines.  Machine is not
limiting for any part of the Solexa pipeline and running any of the
alignment algorithms mentioned above (except perhaps Biostrings) can
be done pretty easily on the human genome on all eight processors
simultaneously with only 32GB of RAM.  That said, memory is
"relatively" cheap, so if you go with 32 GB up front, you may want to
configure that to allow future expansion or just go with 64GB to
start.

> Also if anyone with real word experience can comment on the typical size of the alignment file (for paired end reads on a good day), that is the  s_N_export.txt file generated by ELAND and the s*_sequences.txt file generated by GERALD  that would be helpful too (I have the standard product info). Are there other files generated by the pipeline that you have found particularly useful in downstream analysis or that are useful in other 3rd party applications that you have tried?
>

If you are asking about disk space, think pretty big, and think
expandable if you can.  The Short Read Archive (SRA) at NCBI is
accepting submissions from solexa.  Basically, the entire Bustard
directory is needed for this, so think about saving at least the
bustard and GERALD (or equivalent) directories; ideally you could save
the firecrest directory as well.

Sean