[Bioc-sig-seq] matchPDict, fixed=FALSE; "walk_tb_nonfixed_subject(): implement me"

Mon Jul 19 21:15:12 CEST 2010

On 07/19/2010 12:10 PM, Hervé Pagès wrote:
> Hi Ludo,
>
> matchPDict() and family now support fixed=FALSE by default (i.e.
> when the PDict object was used with algo="ACtree2"):
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                         was preprocessed with algo="ACtree2", which
is the default

Cheers,
H.

>
>  > library(Biostrings)
>  > dict <- DNAStringSet(c("TTC", "CTT"))
>  > pdict <- PDict(dict)
>  > subject <- DNAString("CYTCACTTC")
>  > as.list(matchPDict(pdict, subject, fixed="pattern"))
> [[1]]
> IRanges of length 2
> start end width
> [1] 2 4 3
> [2] 7 9 3
>
> [[2]]
> IRanges of length 2
> start end width
> [1] 1 3 3
> [2] 6 8 3
>
> Matching the probes of an Affy chip to the Human chr1 sequence
> previously altered by injecting all known SNPs from dbSNP:
>
> library(hgu95av2probe)
> probes <- DNAStringSet(hgu95av2probe)
> pdict <- PDict(probes)
>
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(SNPlocs.Hsapiens.dbSNP.20100427)
> SNP_Hsapiens <- injectSNPs(Hsapiens, "SNPlocs.Hsapiens.dbSNP.20100427")
> SNP_subject <- SNP_Hsapiens$chr1
>
> This takes about 2m40s on my laptop (and uses about 1.5GB of memory):
>
> mi1 <- matchPDict(pdict, SNP_subject, fixed="pattern")
>
> Note that the fixed="pattern" feature only works if the density of
> ambiguity letters in the subject is not too high. For your
> particular situation where it contains a stretch of 20 N's it might
> not work (hard to say because the limit also depends on the number
> of nodes in the Aho-Corasick tree, which itself depends on the number
> and lengths of your reads).
>
> Create a 550bp subject with a 20N stretch in the middle:
> subject <- DNAString(paste(c(sample(DNA_BASES, 265, replace=TRUE),
> rep.int("N", 20),
> sample(DNA_BASES, 265, replace=TRUE)),
> collapse=""))
>
> Create 201800 80bp reads:
> reads <- narrow(start=11,
> xscat(probes, reverseComplement(probes),
> rev(probes), rev(reverseComplement(probes))),
> width=80)
>
> Try to align all the reads at once:
>
>  > pdict <- PDict(reads)
>  > mi2 <- matchPDict(pdict, subject, fixed="pattern")
> Error in .match.PDict3Parts.XString(pdict at threeparts, subject,
> max.mismatch, :
> too many IUPAC ambiguity letters in 'subject'
>
> But with a smaller dictionary:
>
>  > pdict <- PDict(reads[1:100000])
>  > mi3 <- matchPDict(pdict, subject, fixed="pattern")
>
> ...it works (and takes about 7s on my laptop).
>
> Unfortunately, while working on this, I discovered that the
> 'fixed=FALSE' feature for matchPDict() and family was broken so
> far (i.e. for Biostrings < 2.16.8): it would miss some matches
> under certain circumstances.
> I spent some good amount of time testing the new implementation
> (my test is still running, should take about 5 days to complete)
> by comparing the matches obtained with matchPDict() with those
> obtained with calling matchPattern() within a loop (for the
> latter the shift-or algo is used which is a completely different
> approach/implementation, it's also much slower). The dict used
> for this test was hgu95av2probe and the subject was Human chr1
> (hg19), like in the 1st example above. The 2 methods are producing
> exactly the same set of matches so far (after checking the first
> 126000 probes in hgu95av2probe).
>
> This is available in the latest release version of Biostrings
> (2.16.8). Please update your packages as explained here:
>
> http://bioconductor.org/docs/install/
>
> Let me know if you have any questions.
>
> Cheers,
> H.
>
>
> On 06/25/2010 11:03 AM, Hervé Pagès wrote:
>> Hi Ludo,
>>
>> Yes matchPDict() used to support fixed=FALSE. It still does, but only
>> when the PDict object is made using the old implementation of the
>> Aho-Corasick algo ('algo="ACtree"'):
>>
>> > pdict <- PDict(c("ACCT", "GACC", "CCCT", "CCCA"), algo="ACtree")
>> > matchPDict(pdict, DNAString("GNCCT"), fixed="pattern")[[3]]
>> IRanges of length 1
>> start end width
>> [1] 2 5 4
>>
>> The "ACtree" algo has been superseded by the "ACtree2" algo, a faster
>> and more memory efficient implementation of the same algo that uses a
>> different internal representation than "ACtree" for the Aho-Corasick
>> tree.
>>
>> The 'fixed=TRUE' (or 'fixed="pattern"') option is not yet supported
>> for PDict objects built with the new algo. I'll add this ASAP. Thanks
>> for the reminder!
>>
>> Cheers,
>> H.
>>
>> On 06/25/2010 03:46 AM, Ludo Pagie wrote:
>>>
>>> hi all,
>>>
>>> I'm trying to match 80bp reads to a construct, a sequence of +/-
>>> 550bp. The construct contains a strecth of N's, representing a
>>> stretch of 20 random nucleotides.
>>>
>>> I constructed a pdict from the reads, and a DNAString from the
>>> construct. When I run matchPDict with fixed=TRUE, all goes fine
>>> and I get 1.2M matches.
>>>
>>>> construct_mindex<- matchPDict(pdict, DNAString(construct),
>>>> max.mismatch=3)
>>>> sum(countIndex(construct_mindex))
>>> [1] 1280283
>>>
>>>
>>> With fixed=FALSE I get the following error:
>>>
>>>> construct_mindex<- matchPDict(pdict, DNAString(construct),
>>>> max.mismatch=3, fixed=FALSE)
>>> Error in .match.PDict3Parts.XString(pdict at threeparts, subject,
>>> max.mismatch, :
>>> walk_tb_nonfixed_subject(): implement me
>>>
>>> Is there a way around this non-implemented function? Or any
>>> chance it will be implemented soon? Or am I missing something.
>>>
>>> If you need more background let me know.
>>>
>>> Ludo
>>>
>>>> sessionInfo()
>>> R version 2.12.0 Under development (unstable) (2010-06-17
>>> r52313)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets
>>> methods base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.7.7 Rsamtools_1.1.7
>>> lattice_0.18-8
>>> [4] GenomicRanges_1.1.12 Biostrings_2.17.7 IRanges_1.7.7
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.9.0 grid_2.12.0 hwriter_1.2 tools_2.12.0
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319