[BioC] motif searching with variable length gaps

Herve Pages hpages at fhcrc.org
Thu Nov 1 23:27:59 CET 2007


Hi Heather,

Would this function work for you?

matchLRPatterns <- function(Lpattern, Rpattern, maxngaps, subject, Lmismatch=0, Rmismatch=0)
{
    Lmatches <- matchPattern(Lpattern, subject, mismatch=Lmismatch)
    Rmatches <- matchPattern(Rpattern, subject, mismatch=Rmismatch)
    ans_start <- ans_end <- integer(0)
    for (i in seq_len(length(Lmatches))) {
        ngaps <- start(Rmatches) - end(Lmatches)[i] - 1
        jj <- which(0 <= ngaps & ngaps <= maxngaps)
        ans_start <- c(ans_start, rep(start(Lmatches)[i], length(jj)))
        ans_end <- c(ans_end, end(Rmatches)[jj])
    }
    views(subject, ans_start, ans_end)
}

Arguments:

  o Lpattern, Rpattern: the left and right parts of your motif (for example
      TACGTGGGT and TTCCACAG for the 3rd motif you gave us: TACGTGGGT--TTCCACAG)

  o maxngaps: the max number of gaps in the middle i.e the distance between
      the left and right parts of your motif

  o subject: a BString (or derived) object containing the subject string
      (in your case it needs to be a DNAString object)

  o Lmismatch, Rmismatch: additionally you can choose to allow a given number of
      mismatches for the left and right parts of the motif

So for example if I want to search motif TACGTGGGT--TTCCACAG in chromosome 1
of the mouse:

  > library(BSgenome.Mmusculus.UCSC.mm9)
  > chr1 <- Mmusculus$chr1
  > matchLRPatterns("TACGTGGGT", "TTCCACAG", maxngaps=2, chr1)
    Views on a 197195432-letter DNAString subject
  Subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAGAATTTGGTATTAAACTTAAAACTGGAATTC
  Views: NONE

I don't find any match. But if I allow the number of gaps to be <= 150 instead of 2:

  > matchLRPatterns("TACGTGGGT", "TTCCACAG", maxngaps=150, chr1)
    Views on a 197195432-letter DNAString subject
  Subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAGAATTTGGTATTAAACTTAAAACTGGAATTC
  Views:
          start       end width
  [1] 193252084 193252245   162 [TACGTGGGTTCCTGACGATGGT...ATGTGAACTCTTTCTTCCACAG]

then I find one match.

Note that matchLRPatterns() will return all matches, even overlapping matches:

  > library(Biostrings)
  > subject <- DNAString("AAATTAACCCTT")

  > matchLRPatterns("AA", "TT", 0, subject)
    Views on a 12-letter DNAString subject
  Subject: AAATTAACCCTT
  Views:
      start end width
  [1]     2   5     4 [AATT]

  > matchLRPatterns("AA", "TT", 1, subject)
    Views on a 12-letter DNAString subject
  Subject: AAATTAACCCTT
  Views:
      start end width
  [1]     1   5     5 [AAATT]
  [2]     2   5     4 [AATT]

  > matchLRPatterns("AA", "TT", 3, subject)
    Views on a 12-letter DNAString subject
  Subject: AAATTAACCCTT
  Views:
      start end width
  [1]     1   5     5 [AAATT]
  [2]     2   5     4 [AATT]
  [3]     6  12     7 [AACCCTT]

  > matchLRPatterns("AA", "TT", 7, subject)
    Views on a 12-letter DNAString subject
  Subject: AAATTAACCCTT
  Views:
      start end width
  [1]     1   5     5 [AAATT]
  [2]     2   5     4 [AATT]
  [3]     2  12    11 [AATTAACCCTT]
  [4]     6  12     7 [AACCCTT]

Also not that matches will always been ordered from left to right.

Please let me know if this is not what you want.

Cheers,
H.


Houseman, Heather wrote:
> Herve,
> 
> My ultimate goal is to find motifs in different sequences that are similar to the ones below.
> 
> TACGTGCTGTCTCACACAG
> GACGTGACTCGGACCACAT
> TACGTGGGT--TTCCACAG
> TACGTGAC----CACACAC
> TACGTGC-------CACAG
> CACGTGC-------CACAC
> GGCGTGAGC-----CACCG
> GGCGTGGGAGCG--CACAG
> TACGTG------CACACAG
> 
> To start off, I'm inserting the motifs above into random sequences to see if I can get cosmo to return those motifs.  Once I get that procedure to work, I'd like to use it to apply it to "real" sequences and hopefully return motifs that look similar to the ones above.
> 
> Here's the cosmo code I'm using:
> 
> res = cosmo(seqs = seqs, minW = 12, maxW = 20, models = "OOPS")
> 
> Is this more along the lines of multiple sequence alignment and not something that I can use cosmo for?
> 
> Thanks!
> 
> Heather
> 
> -----Original Message-----
> From: Herve Pages [mailto:hpages at fhcrc.org]
> Sent: Thursday, November 01, 2007 1:33 PM
> To: Houseman, Heather
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] motif searching with variable length gaps
> 
> Hi Heather,
> 
> Can you please give some examples of your motifs?
> 
> Also showing us the code that you use with cosmo can be useful.
> 
> Even if the matchPattern() function in Biostrings doesn't let you control the number
> of gaps, there might be workarounds, it all depends what your motifs really look
> like. And we need use cases anyway so we know where to put our efforts. Thanks!
> 
> H.
> 
> 
> Houseman, Heather wrote:
>> Dear Bioconductor mailing list:
>>
>> I've been using cosmo to look for motifs.  I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width.  My dilemma is that the motifs I'm looking for are of variable width.
>>
>> Thanks in advance for any help!
>>
>> Heather
>>
>> This email message, including any attachments, is for th...{{dropped:9}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> This email message, including any attachments, is for ...{{dropped:2}}



More information about the Bioconductor mailing list