[Bioc-sig-seq] Bioc short read directions
hpages at fhcrc.org
Wed Apr 2 18:10:46 CEST 2008
Loyal Goff wrote:
> This is a great start...thanks to both Martin and Herve. The speed is
> indeed impressive! I do have one question. Would it be advantageous
> to reduce the data to a unique list of read sequences, and in doing so
> both retain counts in a separate slot and reduce the matrix size? It
> seems to me this would speed everything along as well. (ie. only
> attempt to align a unique sequence once).
PDict()/matchPDict() do this already. A PDict object has a @dups slot for
storing the duplicate information. When the reads are preprocessed with PDict(),
only unique reads are stored in the Aho-Corasick tree (@actree slot), and,
for each duplicated read, a pointer to the first read that it duplicates is
stored in the @dups slot. Then, when the PDict object is passed to matchPDict()
(or countPDict()), the matches are searched only for the unique reads first,
and then the @dups slot is used to also report the matches (or match count)
for the duplicated reads. All this is transparent to the user.
> Does anyone have a need to
> retain independent reads after a quality score cutoff?
> Loyal A. Goff
More information about the Bioc-sig-sequencing