[BioC] Biostrings::matchPattern,extract sequences

Wed Sep 18 02:08:29 CEST 2013

On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote:
> Cool. Thanks, Herve!
>
> Is there a method to extract the mismatch position for all the matches?
> Right now, I am using pairwiseAlignment for each matched subsequence.

No easy way at the moment (and this has been on the TODO list for a
while). A faster workaround than pairwiseAlignment() if you don't use
with.indels=TRUE or fixed=FALSE is:

   is_mismatch <- relist(as.raw(unlist(matches)) != 
as.raw(DNAString(pattern)), matches)

'is_mismatch' should be a LogicalList object with the same shape as
'matches'. Then you can do 'which(is_mismatch)' to get the mismatch
positions. The result will be an IntegerList object.

HTH,
H.

> However, this could become very slow when the number of matched sequences
> gets large.
>
>
> Best regards,
>
> Julie
>
>
> On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:
>
>> Hi Julie,
>>
>> Sorry for the late answer.
>>
>> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote:
>>> Herve,
>>>
>>> Is there a more elegant way to get all matched reference sequences
>>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894
>>> : 3010916] for each matched record? Thanks!
>>>     23-letter "DNAString" instance
>>> seq: GCGGAGCCTGAGGCAGAAACCTC
>>>
>>> matches is the object returned by matchPattern function call.
>>>
>>> matches
>>>     Views on a 197195432-letter DNAString subject
>>> subject:
>>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CCTATTCT
>>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC
>>> views:
>>>              start       end width
>>>      [1]   3010894   3010916    23 [GCGGAGCCTGAGGCAGAAACCTC]
>>>      [2]   3299593   3299615    23 [GCTGTGGCTGAGATGAATACTGG]
>>>      [3]   3637189   3637211    23 [CCTGCTTCTGCCTCTGCCACCGG]
>>>      [4]   3660740   3660762    23 [GCTGTTGCTGCCGCTGTTGGTGG]
>>>      [5]   3661169   3661191    23 [GCTGCCCCGGCCGCCGCCGCCCG]
>>>      [6]   3661721   3661743    23 [CCCGCGGCTGCAGCACGAGCCGC]
>>> ....
>>
>> Just turn this into a DNAStringSet object with a coercion:
>>
>>     as(matches, "DNAStringSet")
>>
>> or by calling the DNAStringSet() constructor on it:
>>
>>     DNAStringSet(matches)
>>
>> Cheers,
>> H.
>>
>>>
>>> Best regards,
>>>
>>> Julie
>>>
>>>    sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] grid      parallel  stats     graphics  grDevices utils     datasets
>>>    methods   base
>>>
>>> other attached packages:
>>>    [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19  BiocInstaller_1.10.3
>>>                  REDseq_1.6.0
>>>    [4] ChIPpeakAnno_2.8.0                  GenomicFeatures_1.12.3
>>>                limma_3.16.7
>>>    [7] org.Hs.eg.db_2.9.0                  GO.db_2.9.0
>>>                           RSQLite_0.11.4
>>> [10] DBI_0.2-7                           AnnotationDbi_1.22.6
>>>                  BSgenome.Ecoli.NCBI.20080805_1.3.17
>>> [13] biomaRt_2.16.0                      VennDiagram_1.6.5
>>>                     multtest_2.16.0
>>> [16] Biobase_2.20.1
>>>                        BSgenome.Celegans.UCSC.ce2_1.3.19   BSgenome_1.28.0
>>> [19] ShortRead_1.18.0                    latticeExtra_0.6-26
>>>                   RColorBrewer_1.0-5
>>> [22] Rsamtools_1.12.4                    lattice_0.20-23
>>>                       Biostrings_2.28.0
>>> [25] GenomicRanges_1.12.5                IRanges_1.18.3
>>>                        BiocGenerics_0.6.0
>>>
>>> loaded via a namespace (and not attached):
>>>    [1] bitops_1.0-6       hwriter_1.3        MASS_7.3-29
>>>          RCurl_1.95-4.1     rtracklayer_1.20.4 splines_3.0.1
>>>        stats4_3.0.1
>>>    [8] survival_2.37-4    tools_3.0.1        XML_3.95-0.2
>>>         zlibbioc_1.6.0
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319