[BioC] Biostrings::matchPattern,extract sequences

Wed Sep 18 00:52:08 CEST 2013

Hi Julie,

Sorry for the late answer.

On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote:
> Herve,
>
> Is there a more elegant way to get all matched reference sequences
> besides using subject(matches)[start:end], e.g, subject(matches)[3010894
> : 3010916] for each matched record? Thanks!
>    23-letter "DNAString" instance
> seq: GCGGAGCCTGAGGCAGAAACCTC
>
> matches is the object returned by matchPattern function call.
>
> matches
>    Views on a 197195432-letter DNAString subject
> subject:
> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CCTATTCTAGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC
> views:
>             start       end width
>     [1]   3010894   3010916    23 [GCGGAGCCTGAGGCAGAAACCTC]
>     [2]   3299593   3299615    23 [GCTGTGGCTGAGATGAATACTGG]
>     [3]   3637189   3637211    23 [CCTGCTTCTGCCTCTGCCACCGG]
>     [4]   3660740   3660762    23 [GCTGTTGCTGCCGCTGTTGGTGG]
>     [5]   3661169   3661191    23 [GCTGCCCCGGCCGCCGCCGCCCG]
>     [6]   3661721   3661743    23 [CCCGCGGCTGCAGCACGAGCCGC]
> ....

Just turn this into a DNAStringSet object with a coercion:

   as(matches, "DNAStringSet")

or by calling the DNAStringSet() constructor on it:

   DNAStringSet(matches)

Cheers,
H.

>
> Best regards,
>
> Julie
>
>   sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] grid      parallel  stats     graphics  grDevices utils     datasets
>   methods   base
>
> other attached packages:
>   [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19  BiocInstaller_1.10.3
>                 REDseq_1.6.0
>   [4] ChIPpeakAnno_2.8.0                  GenomicFeatures_1.12.3
>               limma_3.16.7
>   [7] org.Hs.eg.db_2.9.0                  GO.db_2.9.0
>                          RSQLite_0.11.4
> [10] DBI_0.2-7                           AnnotationDbi_1.22.6
>                 BSgenome.Ecoli.NCBI.20080805_1.3.17
> [13] biomaRt_2.16.0                      VennDiagram_1.6.5
>                    multtest_2.16.0
> [16] Biobase_2.20.1
>                       BSgenome.Celegans.UCSC.ce2_1.3.19   BSgenome_1.28.0
> [19] ShortRead_1.18.0                    latticeExtra_0.6-26
>                  RColorBrewer_1.0-5
> [22] Rsamtools_1.12.4                    lattice_0.20-23
>                      Biostrings_2.28.0
> [25] GenomicRanges_1.12.5                IRanges_1.18.3
>                       BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] bitops_1.0-6       hwriter_1.3        MASS_7.3-29
>         RCurl_1.95-4.1     rtracklayer_1.20.4 splines_3.0.1
>       stats4_3.0.1
>   [8] survival_2.37-4    tools_3.0.1        XML_3.95-0.2
>        zlibbioc_1.6.0

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319