[BioC] substr on XStringSet-class

Sun Dec 29 03:18:18 CET 2013

Hi Tim,

Here is how you can do this:

   library(Biostrings)
   genes <- DNAStringSet(c("GTTGATTAC", "AGGACCT", "AGTTTGTTCCGTTCACCTACC"))
   m0 <- vmatchPattern("GTT", genes)

Then:

   > m0
   MIndex object of length 3
   [[1]]
   IRanges of length 1
       start end width
   [1]     1   3     3

   [[2]]
   IRanges of length 0

   [[3]]
   IRanges of length 3
       start end width
   [1]     2   4     3
   [2]     6   8     3
   [3]    11  13     3

Nb of matches per gene:

   > elementLengths(m0)  # equivalent to vcountPattern("GTT", genes)
   [1] 1 0 3

To extend the ranges to the right, ideally we'd like to be able to
do something like:

   resize(m0, width=8)  # doesn't work on an MIndex object

but this is not yet supported on MIndex objects (the type of 'm0').
A workaround for now is to turn 'm0' into a CompressedIRangesList
object first:

   m1 <- as(m0, "CompressedIRangesList")

Then:

   m2 <- resize(m1, width=8)

Now we can use extractAt() to extract the corresponding sequences:

   > extractAt(genes, m2)
   DNAStringSetList of length 3
   [[1]] GTTGATTA
   [[2]]   A DNAStringSet instance of length 0
   [[3]] GTTTGTTC GTTCCGTT GTTCACCT

See ?extractAt for more information. The man page actually has an example
that shows how to use extractAt() for extracting the match sequences
corresponding to the result of vmatchPattern().

Cheers,
H.

On 12/20/2013 03:52 PM, Tim Homan [guest] wrote:
>
> Hello,
>
> I would like to get all the substrings of a patternmatch on a XStringSet-class. I now use the following code, but this ignores multiple matches and I have the feeling there is a better way to do it that uses biostrings fuctions.
>
> I load a fastafile into a XStringSet-class object and then search for a specific string using the vmatchPattern function:
>
> genes <- readDNAStringSet(File = "filename", format = "fasta", use.names = T)
> view <- vmatchPattern(pattern = "CCGGA", genes)
> matches <- unlist(view, recursive = T, use.names = T)
> m <- as.matrix(matches)
>
> I retrieve a substring starting at the match and 20 positions upward:
>
> subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20)
>
> What is a better way to do this that includes all possible matches?
>
>   -- output of sessionInfo():
>
> x
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319