[Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"

Mon Jul 19 21:10:46 CEST 2010

Hi Ludo,

matchPDict() and family now support fixed=FALSE by default (i.e.
when the PDict object was used with algo="ACtree2"):

   > library(Biostrings)
   > dict <- DNAStringSet(c("TTC", "CTT"))
   > pdict <- PDict(dict)
   > subject <- DNAString("CYTCACTTC")
   > as.list(matchPDict(pdict, subject, fixed="pattern"))
   [[1]]
   IRanges of length 2
       start end width
   [1]     2   4     3
   [2]     7   9     3

   [[2]]
   IRanges of length 2
       start end width
   [1]     1   3     3
   [2]     6   8     3

Matching the probes of an Affy chip to the Human chr1 sequence
previously altered by injecting all known SNPs from dbSNP:

   library(hgu95av2probe)
   probes <- DNAStringSet(hgu95av2probe)
   pdict <- PDict(probes)

   library(BSgenome.Hsapiens.UCSC.hg19)
   library(SNPlocs.Hsapiens.dbSNP.20100427)
   SNP_Hsapiens <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20100427")
   SNP_subject <- SNP_Hsapiens$chr1

This takes about 2m40s on my laptop (and uses about 1.5GB of memory):

   mi1 <- matchPDict(pdict, SNP_subject, fixed="pattern")

Note that the fixed="pattern" feature only works if the density of
ambiguity letters in the subject is not too high. For your
particular situation where it contains a stretch of 20 N's it might
not work (hard to say because the limit also depends on the number
of nodes in the Aho-Corasick tree, which itself depends on the number
and lengths of your reads).

Create a 550bp subject with a 20N stretch in the middle:
   subject <- DNAString(paste(c(sample(DNA_BASES, 265, replace=TRUE),
                                rep.int("N", 20),
                                sample(DNA_BASES, 265, replace=TRUE)),
                              collapse=""))

Create 201800 80bp reads:
   reads <- narrow(start=11,
              xscat(probes, reverseComplement(probes),
                    rev(probes), rev(reverseComplement(probes))),
                   width=80)

Try to align all the reads at once:

   > pdict <- PDict(reads)
   > mi2 <- matchPDict(pdict, subject, fixed="pattern")
   Error in .match.PDict3Parts.XString(pdict at threeparts, subject, 
max.mismatch,  :
     too many IUPAC ambiguity letters in 'subject'

But with a smaller dictionary:

   > pdict <- PDict(reads[1:100000])
   > mi3 <- matchPDict(pdict, subject, fixed="pattern")

...it works (and takes about 7s on my laptop).

Unfortunately, while working on this, I discovered that the
'fixed=FALSE' feature for matchPDict() and family was broken so
far (i.e. for Biostrings < 2.16.8): it would miss some matches
under certain circumstances.
I spent some good amount of time testing the new implementation
(my test is still running, should take about 5 days to complete)
by comparing the matches obtained with matchPDict() with those
obtained with calling matchPattern() within a loop (for the
latter the shift-or algo is used which is a completely different 
approach/implementation, it's also much slower). The dict used
for this test was hgu95av2probe and the subject was Human chr1
(hg19), like in the 1st example above. The 2 methods are producing
exactly the same set of matches so far (after checking the first
126000 probes in hgu95av2probe).

This is available in the latest release version of Biostrings
(2.16.8). Please update your packages as explained here:

   http://bioconductor.org/docs/install/

Let me know if you have any questions.

Cheers,
H.

On 06/25/2010 11:03 AM, Hervé Pagès wrote:
> Hi Ludo,
>
> Yes matchPDict() used to support fixed=FALSE. It still does, but only
> when the PDict object is made using the old implementation of the
> Aho-Corasick algo ('algo="ACtree"'):
>
>  > pdict <- PDict(c("ACCT", "GACC", "CCCT", "CCCA"), algo="ACtree")
>  > matchPDict(pdict, DNAString("GNCCT"), fixed="pattern")[[3]]
> IRanges of length 1
> start end width
> [1] 2 5 4
>
> The "ACtree" algo has been superseded by the "ACtree2" algo, a faster
> and more memory efficient implementation of the same algo that uses a
> different internal representation than "ACtree" for the Aho-Corasick
> tree.
>
> The 'fixed=TRUE' (or 'fixed="pattern"') option is not yet supported
> for PDict objects built with the new algo. I'll add this ASAP. Thanks
> for the reminder!
>
> Cheers,
> H.
>
> On 06/25/2010 03:46 AM, Ludo Pagie wrote:
>>
>> hi all,
>>
>> I'm trying to match 80bp reads to a construct, a sequence of +/-
>> 550bp. The construct contains a strecth of N's, representing a
>> stretch of 20 random nucleotides.
>>
>> I constructed a pdict from the reads, and a DNAString from the
>> construct. When I run matchPDict with fixed=TRUE, all goes fine
>> and I get 1.2M matches.
>>
>>> construct_mindex<- matchPDict(pdict, DNAString(construct),
>>> max.mismatch=3)
>>> sum(countIndex(construct_mindex))
>> [1] 1280283
>>
>>
>> With fixed=FALSE I get the following error:
>>
>>> construct_mindex<- matchPDict(pdict, DNAString(construct),
>>> max.mismatch=3, fixed=FALSE)
>> Error in .match.PDict3Parts.XString(pdict at threeparts, subject,
>> max.mismatch, :
>> walk_tb_nonfixed_subject(): implement me
>>
>> Is there a way around this non-implemented function? Or any
>> chance it will be implemented soon? Or am I missing something.
>>
>> If you need more background let me know.
>>
>> Ludo
>>
>>> sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-06-17
>> r52313)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets
>> methods base
>>
>> other attached packages:
>> [1] ShortRead_1.7.7 Rsamtools_1.1.7
>> lattice_0.18-8
>> [4] GenomicRanges_1.1.12 Biostrings_2.17.7 IRanges_1.7.7
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.9.0 grid_2.12.0 hwriter_1.2 tools_2.12.0
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319