[BioC] Motif search

Paul Shannon pshannon at fhcrc.org
Fri Apr 20 21:37:23 CEST 2012


Hi Nooshin,

Here is a script which demonstrates the basics of 

  1) how to obtain the promoter region of a named gene (paramaterized by upstream & downstream lengths, default 1000 and 100)
  2) searching in that sequence for a motif, specified either as a PWM or a simple character string.

If you want to do this in bulk, Herve' has some lovely code to make that efficient.  But if (as you say) you have just a few genes to look at, this example should be enough to get you going, acquaint you with a few data packages (TxDb, BSgenome.At) and functions (matchPattern and matchPWM).

I include a convenience function ('get.promoter') and its test ('test.get.promoter').  This is "strand aware" and uses coordinates from the TxDb object.  I think a quick glance will allow you to make sense of this.  

Execute 'test.get.promoter ()' just to make sure everything is installed properly for you.  It should return TRUE.


I use a programming idiom here, well-suited to incremental exploration of a programming task, which you may like (or may not).  Each step executes, one step at a time, by successive calls to 'run', like this:

  run (0)   # gets the TxDb data into a form where gene coordinates are handy
  run (1)   # sets up the two motifs
  run (2)   # gets the promoter for AT1G01020
  run (3)   # prints the match to the simple motif sequence
  run (4)   # prints the match to the PWM motif

Global variables (yikes!) are used in the run function, so that the values created at each call live on to be used by the next one.

Let me know what I am leaving out!

We can dig into Herve's techniques for fast whole-genome sequence extraction and search if that's needed.

 - Paul


# sequence and PWM match demo
#------------------------------------------------------------------------------------------------------------------------
library (RUnit)
library (BSgenome.Athaliana.TAIR.TAIR9)
library (TxDb.Athaliana.BioMart.plantsmart12)
#------------------------------------------------------------------------------------------------------------------------
run = function (levels)
{
  if (0 %in% levels) {
    txdb <<- TxDb.Athaliana.BioMart.plantsmart12
    all.gene.info <<- transcriptsBy (txdb, by = "gene")
    } # 0

  if (1 %in% levels) { # a binding
     # http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?ID=MA0005.1&rm=present&collection=CORE
    ag.jaspar.pwm <<- rbind (A = c (0,   2, 38, 61, 73, 31, 23, 20, 22,  1, 27),
                             C = c (87, 86,  9,  5,  1,  3, 16, 13,  0,  0, 25),
                             G = c (2,   0,  7,  2,  1,  4, 17, 26, 64, 82,  9),
                             T = c (1,   2, 36, 22, 15, 52, 34, 31,  4,  7, 29))
    madeup.motif.seq <<- 'GGTTCTT'
    } # 1

  if (2 %in% levels) {
    promoter <<- get.promoter (all.gene.info, 'AT1G01020', 500, 100)
    } # 2

  if (3 %in% levels) {
    print (matchPattern (madeup.motif.seq, promoter))
    } # 3

  if (4 %in% levels) {
    print (matchPWM (ag.jaspar.pwm, promoter, min.score='90%'))
    } # 4

} # run
#------------------------------------------------------------------------------------------------------------------------
# returns a char string from proper strand, reverse complemented if appropriated.
get.promoter = function (all.gene.info, gene.id, upstream=1000, downstream=100)
{
  if (!gene.id %in% names (all.gene.info))
    return (NA)

  gene.info = all.gene.info [[gene.id]]
  start.loc <<- min (start (gene.info))
  end.loc <<- max (end (gene.info))
  strand.name <<- unique (as.character (strand (gene.info)))
  chromosome.name <<- unique (as.character (seqnames (gene.info)))[1]
  has.proper.chromosome.name = length (grep ('Chr', chromosome.name)) == 1    # needs to be of form 'Chr5'
  if (!has.proper.chromosome.name)
    chromosome.name <<- paste ('Chr', chromosome.name, sep='')
  stopifnot (chromosome.name %in% names (Athaliana))
  chr.seq = Athaliana [[chromosome.name]]

  if (strand.name == '+') {
    promoter.start = start.loc - upstream
    promoter.end =   start.loc + downstream
    promoter.seq = subseq (chr.seq, promoter.start, promoter.end)
    } # + strand
  if (strand.name == '-') {
    promoter.start = end.loc - upstream
    promoter.end =   end.loc + downstream -1
    start.xx <<- promoter.start
    end.xx <<- promoter.end
    promoter.seq = reverseComplement (subseq (chr.seq, promoter.start, promoter.end))
    } # else, - strand

  invisible (toString (promoter.seq))

} # get.promoter
#------------------------------------------------------------------------------------------------------------------------
test.get.promoter = function ()
{
  promoter.at1g01020.10  <<- get.promoter (all.gene.info, 'AT1G01020', 10, 0)
  checkEquals (nchar (promoter.at1g01020.10), 10)
  checkEquals (promoter.at1g01020.10, "GACCCGGACT")

  promoter.at1g01020.40  <<- get.promoter (all.gene.info, 'AT1G01020', 20, 20)
  checkEquals (nchar (promoter.at1g01020.40), 40)

} # test.get.promoter
#------------------------------------------------------------------------------------------------------------------------

On Apr 20, 2012, at 7:28 AM, nooshin wrote:

> 
> Thanks for reply.
> 
> Answer for the questions:
> 1) for one TF I have position weight matrix and for the second one I have only the sequence.
> 2) I need both, exact and approximate match
> 3) Yes, exactly :)
> 
> -Nooshin
> 
> 
> On 04/20/2012 04:19 PM, Paul Shannon wrote:
>>> 
>>> I'm so new in doing sequence analysis and I have a small project to do, which I need help for.
>>> I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me.
>>> 
>> Having recently learned how to do something very similar, I'd be glad to help.  A few clarifying questions:
>> 
>>  1) In what form is the binding site expressed?  Simple sequence, position weight matrix?
>>  2) Do you need only exact matches to this motif?  Or approximate matches also?
>>  3) I presume you will be looking for matches in the promoters of your set of genes.  True?
>> 
>>  - Paul
>> 
>>> Thanks,
>>> Nooshin
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 



More information about the Bioconductor mailing list