[BioC] A basic question about retrieve the absolute genome position from the output of of function matchPWM .

Sun Oct 28 06:56:43 CET 2012

Hi Holly,

See comments and suggestions below.

On Oct 27, 2012, at 7:54 PM, Holly wrote:

> Hi, Steve,
> Thank you.
> 
> 49449428+999-354 =49450073
> 49449428+999-347= 49450080
> By searching
> 
> chr3:49450073-49450080 I got [GGTGAGGC], which is reversely complementary to [AAGCCTCA] as expected.
> 
>> reverseComplement(DNAString("GGTGAGGC"))
>  8-letter "DNAString" instance
> seq: CAGCCTCA
> 
> More question is, to screen the "promoter" region, what should I do?
> Should I focus on the "upstream" of "positive" strand? Does it equally to search the "downstream" of "negative" strand, if the program reports sequence of only negative strain?

getPromoterSeq is strand-aware in two ways.  For gene transcripts annotated to the minus strand:

   1) It interprets 'upstream' as 'increasing chromosomal coordinates'.
   2) It returns the reverse complement of the extracted sequence, which is what you probably want.

In the following, I specify an upstream length of just 10, to make experimentation easier:

library(org.Hs.eg.db)
library(MotifDb)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
geneID <-as.character (get ("RHOA", org.Hs.egSYMBOL2EG))
loc <- transcriptsBy(txdb, by="gene") [geneID][[1]][1]
 GRanges with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr3 [49396579, 49449428]      - |     15517  uc010hku.3

getPromoterSeq(loc, Hsapiens, upstream=10, downstream=0)
   A DNAStringSet instance of length 1
     width seq
        10 TCCCCGCCCT

If you point the UCSC genome browser, using genome=hg19, to chr3:49449428:49449437, click the 'complement bases' arrow in the top left, you will see, reading from right to left, and starting at chr3:49449437, the same sequence returned by getPromoterSeq: TCCCCGCCCT.  

Another handy trick is to ask getPromoterSeq to return 20 bases upstream.  This is a long enough sequence to use UCSC BLAT, which will take you right to the proper location in the genome browser.  http://genome.ucsc.edu/cgi-bin/hgBlat?command=start.  BLAT will try to match your input sequence on both strands, and report which one it found.

I believe that all of this demonstrates that getPromoterSeq and its upstream and downstream arguments are relative to the gene transcript's TSS, and are corrected to return a "gene's-eye" view of the sequence.  For a minus strand gene, upstream is in increasing genome coordinates, and the returned sequence has been reverse complemented.  For a plus strand gene, upstream is decreasing genome coordinates, and the returned sequence can be read off the genome from left  to right.  You don't have to worry about strand, nor compensate for it.

Another thing to keep in mind here is something suggested by Steve a couple of weeks ago: it may be better to think in terms of proximal regulatory regions, a full 5000 bp upstream AND downstream of the transcription start site.  This is especially applicable to metazoans, and to mammals in particular.

I hope this helps,

 - Paul

> Thanks again,
> Holly
> 
> On 10/27/2012 6:19 PM, Steve Lianoglou wrote:
> 
>> Hi Holly,
>> 
>> On Sat, Oct 27, 2012 at 7:23 PM, Holly <xyang2 at uchicago.edu> wrote:
>>> Dear all,
>>> I am learning how to "Finding Candidate Binding Sites for Known
>>> Transcription Factors via Sequence Matching" from the Bioconductor
>>> online help file at
>>> http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/
>>> 
>>> When I ran the following codes, I can not match the results back to the
>>> UCSC genome browser results.
>>> 
>>> ||>|orfs <- as.character(mget("RHOA", org.Hs.egSYMBOL2EG))
>>> ||||>||  pfm.grhl <- query(MotifDb,"GRHL")[[1]]
>>>> pcm.grhl <- round(100 * pfm.grhl)
>>>> chromosomal.loc <-  transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") [orfs]
>>>> promoter.seqs <-  getPromoterSeq(chromosomal.loc, Hsapiens, upstream=1000, downstream=0)
>>>> promoter.seqs <- unlist(promoter.seqs)
>>>> matchPWM(pcm.grhl, promoter.seqs[[1]], "80%")
>>>   Views on a 1000-letter DNAString subject
>>> subject: AGCCCAGGTCAGGTTAGAGACCACTGGTATTGTT...GGCACTCGGAGGCGCGCACGTCGTTCCCCGCCCT
>>> views:
>>>      start end width
>>> [1]   347 354     8 [AAGCCTCA]
>>> 
>>>> chromosomal.loc
>>> GRangesList of length 1:
>>> $387
>>> GRanges with 2 ranges and 2 metadata columns:
>>>        seqnames               ranges strand |     tx_id     tx_name
>>>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>>>    [1]     chr3 [49396579, 49449428]      - |     15517  uc010hku.3
>>>    [2]     chr3 [49396579, 49449526]      - |     15518  uc003cwu.3
>>> 
>>> ---
>>> seqlengths:
>>>                    chr1                  chr2 ...        chrUn_gl000249
>>>               249250621             243199373 ...                 38502
>>> 
>>> As|||49396579-1000+347=49395926, I use it as an absolute position.|
>>> However, when I visit the UCSC genome browser, for the region chr3:||||49395926||-||||49395934.||
>>>   I got [GTGCTCCTC], instead of [|||AAGCCTCA] or similar.|
>>> 
>>> So, could anyone tell me how to calculate the absolute genome  position from the output of of function matchPWM ?.
>> Beware of the evil strand.
>> 
>> This transcript is on the negative strand, so the 1000bp upstream of
>> the first transcript you've listed starts at its end, namely
>> `(end+1):(end+999)`
>> 
>> HTH,
>> -steve
>> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor