[Bioc-sig-seq] [BioC] Alignments timings

Fri Nov 21 20:17:05 CET 2008

Hi Werner,

Werner Van Belle wrote:
> Hello,
> 
> Just out of pure curiosity. If I have around 60'000'000 short
> fragmnents. A typical output of an Illumina GAII experiment. Can your
> package align these to a reference genome, such as the Human genome and
> if so, how much memory is required to complete this process. How much
> time is necessary on a typical GAII server ? That is 16 Gb memory and 4
> intel quadcores.

What is the length of your fragments?

The PDict/matchPDict tool in the Biostrings package uses an approach 
that consists in preprocessing the short fragments. The result of this
preprocessing is a PDict object that is currently taking a lot of
memory: around 7GB for 10 millions 36-mers. Also 10M 36-mers / 15M
25-mers is close to the maximum number of short fragments that you
can store in a PDict object so you'll have to split your original set
of 60M fragments.

Using the PDict object to match the 10M (or 15M) fragments against
the Human genome (+ and - strand) should take about 1 hour on a Linux
server with 16GB of RAM. That's for exact matching. PDict/matchPDict
also supports inexact matching (a small number of mismatches per read,
let's say 1 or 2) but this will increase the time by a factor 12x
for 1 mismatch and a factor 240x for 2 mismatches!

> 
> Could you compare the speed of your processing against something like
> for instance Eland ?

PDict/matchPDict is very fast for exact matching. If you want to allow
up to 2 mismatches, a tool like bowtie will be much faster. The speed
of PDict/matchPDict will be comparable to that of MAQ but
PDict/matchPDict uses more memory. I'm not sure how it compares with
Eland but I think MAQ is faster than Eland.

Note that bowtie, MAQ and Eland do quality-based alignments.
PDict/matchPDict doesn't use the quality at all.
Another difference is that PDict/matchPDict will return all the matches
for all the reads. bowtie, MAQ and Eland return at most 1 match
per read. With PDict/matchPDict it's up to you to decide what to do with
the reads that have multiple matches. Also for now, we offer no
facilities to write the output of matchPDict() to a file (this will
be added soon).

PDict/matchPDict is still a very young tool and there is still room for
improvements like making the PDict object more compact in memory,
allow it to store 50M or more short reads, support indels, limiting
the number of matches per read that is returned by matchPDict(),
provide IO facilities, etc... User feedback will help us to set
priorities.

Cheers,
H.

> 
> Wkr,
> 
> Werner,-
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319