[Bioc-sig-seq] Bioc short read directions

Wed Apr 2 18:10:46 CEST 2008

Hi Loyal,

Loyal Goff wrote:
> This is a great start...thanks to both Martin and Herve. The speed is  
> indeed impressive! I do have one question.  Would it be advantageous  
> to reduce the data to a unique list of read sequences, and in doing so  
> both retain counts in a separate slot and reduce the matrix size? It  
> seems to me this would speed everything along as well. (ie. only  
> attempt to align a unique sequence once).

PDict()/matchPDict() do this already. A PDict object has a @dups slot for
storing the duplicate information. When the reads are preprocessed with PDict(),
only unique reads are stored in the Aho-Corasick tree (@actree slot), and,
for each duplicated read, a pointer to the first read that it duplicates is
stored in the @dups slot. Then, when the PDict object is passed to matchPDict()
(or countPDict()), the matches are searched only for the unique reads first,
and then the @dups slot is used to also report the matches (or match count)
for the duplicated reads. All this is transparent to the user.

Cheers,
H.

>  Does anyone have a need to  
> retain independent reads after a quality score cutoff?
> 
> Loyal
> 
> Loyal A. Goff
>