[Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"

Fri Jun 25 20:03:16 CEST 2010

Hi Ludo,

Yes matchPDict() used to support fixed=FALSE. It still does, but only
when the PDict object is made using the old implementation of the
Aho-Corasick algo ('algo="ACtree"'):

   > pdict <- PDict(c("ACCT", "GACC", "CCCT", "CCCA"), algo="ACtree")
   > matchPDict(pdict, DNAString("GNCCT"), fixed="pattern")[[3]]
   IRanges of length 1
       start end width
   [1]     2   5     4

The "ACtree" algo has been superseded by the "ACtree2" algo, a faster
and more memory efficient implementation of the same algo that uses a
different internal representation than "ACtree" for the Aho-Corasick
tree.

The 'fixed=TRUE' (or 'fixed="pattern"') option is not yet supported
for PDict objects built with the new algo. I'll add this ASAP. Thanks
for the reminder!

Cheers,
H.

On 06/25/2010 03:46 AM, Ludo Pagie wrote:
>
> hi all,
>
> I'm trying to match 80bp reads to a construct, a sequence of +/-
> 550bp. The construct contains a strecth of N's, representing a
> stretch of 20 random nucleotides.
>
> I constructed a pdict from the reads, and a DNAString from the
> construct. When I run matchPDict with fixed=TRUE, all goes fine
> and I get 1.2M matches.
>
>> construct_mindex<- matchPDict(pdict, DNAString(construct), max.mismatch=3)
>> sum(countIndex(construct_mindex))
> [1] 1280283
>
>
> With fixed=FALSE I get the following error:
>
>> construct_mindex<- matchPDict(pdict, DNAString(construct), max.mismatch=3, fixed=FALSE)
> Error in .match.PDict3Parts.XString(pdict at threeparts, subject, max.mismatch,  :
>    walk_tb_nonfixed_subject(): implement me
>
> Is there a way around this non-implemented function? Or any
> chance it will be implemented soon? Or am I missing something.
>
> If you need more background let me know.
>
> Ludo
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-06-17
> r52313)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
> [1] ShortRead_1.7.7      Rsamtools_1.1.7
> lattice_0.18-8
> [4] GenomicRanges_1.1.12 Biostrings_2.17.7    IRanges_1.7.7
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 grid_2.12.0   hwriter_1.2   tools_2.12.0
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing