[BioC] Biostrings::matchPattern,extract sequences
Hervé Pagès
hpages at fhcrc.org
Wed Sep 18 02:08:29 CEST 2013
On 09/17/2013 04:51 PM, Zhu, Lihua (Julie) wrote:
> Cool. Thanks, Herve!
>
> Is there a method to extract the mismatch position for all the matches?
> Right now, I am using pairwiseAlignment for each matched subsequence.
No easy way at the moment (and this has been on the TODO list for a
while). A faster workaround than pairwiseAlignment() if you don't use
with.indels=TRUE or fixed=FALSE is:
is_mismatch <- relist(as.raw(unlist(matches)) !=
as.raw(DNAString(pattern)), matches)
'is_mismatch' should be a LogicalList object with the same shape as
'matches'. Then you can do 'which(is_mismatch)' to get the mismatch
positions. The result will be an IntegerList object.
HTH,
H.
> However, this could become very slow when the number of matched sequences
> gets large.
>
>
> Best regards,
>
> Julie
>
>
> On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:
>
>> Hi Julie,
>>
>> Sorry for the late answer.
>>
>> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote:
>>> Herve,
>>>
>>> Is there a more elegant way to get all matched reference sequences
>>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894
>>> : 3010916] for each matched record? Thanks!
>>> 23-letter "DNAString" instance
>>> seq: GCGGAGCCTGAGGCAGAAACCTC
>>>
>>> matches is the object returned by matchPattern function call.
>>>
>>> matches
>>> Views on a 197195432-letter DNAString subject
>>> subject:
>>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CCTATTCT
>>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC
>>> views:
>>> start end width
>>> [1] 3010894 3010916 23 [GCGGAGCCTGAGGCAGAAACCTC]
>>> [2] 3299593 3299615 23 [GCTGTGGCTGAGATGAATACTGG]
>>> [3] 3637189 3637211 23 [CCTGCTTCTGCCTCTGCCACCGG]
>>> [4] 3660740 3660762 23 [GCTGTTGCTGCCGCTGTTGGTGG]
>>> [5] 3661169 3661191 23 [GCTGCCCCGGCCGCCGCCGCCCG]
>>> [6] 3661721 3661743 23 [CCCGCGGCTGCAGCACGAGCCGC]
>>> ....
>>
>> Just turn this into a DNAStringSet object with a coercion:
>>
>> as(matches, "DNAStringSet")
>>
>> or by calling the DNAStringSet() constructor on it:
>>
>> DNAStringSet(matches)
>>
>> Cheers,
>> H.
>>
>>>
>>> Best regards,
>>>
>>> Julie
>>>
>>> sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] grid parallel stats graphics grDevices utils datasets
>>> methods base
>>>
>>> other attached packages:
>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 BiocInstaller_1.10.3
>>> REDseq_1.6.0
>>> [4] ChIPpeakAnno_2.8.0 GenomicFeatures_1.12.3
>>> limma_3.16.7
>>> [7] org.Hs.eg.db_2.9.0 GO.db_2.9.0
>>> RSQLite_0.11.4
>>> [10] DBI_0.2-7 AnnotationDbi_1.22.6
>>> BSgenome.Ecoli.NCBI.20080805_1.3.17
>>> [13] biomaRt_2.16.0 VennDiagram_1.6.5
>>> multtest_2.16.0
>>> [16] Biobase_2.20.1
>>> BSgenome.Celegans.UCSC.ce2_1.3.19 BSgenome_1.28.0
>>> [19] ShortRead_1.18.0 latticeExtra_0.6-26
>>> RColorBrewer_1.0-5
>>> [22] Rsamtools_1.12.4 lattice_0.20-23
>>> Biostrings_2.28.0
>>> [25] GenomicRanges_1.12.5 IRanges_1.18.3
>>> BiocGenerics_0.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6 hwriter_1.3 MASS_7.3-29
>>> RCurl_1.95-4.1 rtracklayer_1.20.4 splines_3.0.1
>>> stats4_3.0.1
>>> [8] survival_2.37-4 tools_3.0.1 XML_3.95-0.2
>>> zlibbioc_1.6.0
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list