[BioC] PWMmatch: position weight matrix or position frequency matrix?
Hervé Pagès
hpages at fhcrc.org
Tue Feb 22 19:59:18 CET 2011
Hi Zuzanna,
On 02/17/2011 06:59 AM, Zuzanna Makowska wrote:
> Dear List,
>
> I have a question regarding the matchPWM function of Biostrings package.
>
> The help page for the function states that it requires a position weight matrix as an input. At the same time I found a post on the list giving a following example of the use of the function:
>
> (quoting:
> [BioC] matching transcription factor binding sites
>
> Herve Pages hpages at fhcrc.org
>
> Sat Apr 19 02:41:03 CEST 2008)
>
> Suppose 'pwm' contains a Position Weight Matrix, let's say:
>
> pwm<- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7),
> C=c( 1, 0, 1, 0, 1, 18, 0, 2),
> G=c(17, 0, 0, 0, 1, 0, 0, 3),
> T=c( 1, 20, 0, 0, 0, 1, 0, 8))
>
> Note that this is just a standard integer matrix with the 4 DNA base letters
> as row names (having these row names is mandatory).
> m<- matchPWM(pwm, chr1, min.score="90%")
>
> It seems to me that the matrix in this example is a position frequency matrix and not a position weight matrix (the difference between the two is explained nicely in: Applied Bioinformatics for the identification of regulatory elements; WW Wasserman&A Sandelin, Nat Rev Genet, 2004).
Thanks for the pointer to Wasserman & Sandelin's paper.
I confirm that the matchPWM() function expects the input to be
a position *weight* matrix. What makes the 'pwm' object above maybe
look like a position *frequency* matrix is because, unlike in
Wasserman's paper, it contains non-negative integer weights.
Furthermore, all the columns sum to the same value:
> colSums(pwm)
[1] 20 20 20 20 20 20 20 20
I understand that this is indeed misleading.
But 'pwm' could also be something like:
pwm <- rbind(A=c(0.06, -0.02, 0.30),
C=c(0.00, 0.17, 0.00),
G=c(0.03, 0.05, 0.12),
T=c(0.22, -0.01, 0.08))
It is really treated by the matchPWM() function as a position-specific
scoring matrices. You can check this by computing the score for a few
given positions:
> PWMscoreStartingAt(pwm, DNAString("TTCAA"), starting.at=1:3)
[1] 0.21 0.69 0.28
Then, as you can see, matchPWM() returns the match corresponding to the
position that produces the highest score:
> matchPWM(pwm, DNAString("TTCAA"))
Views on a 5-letter DNAString subject
subject: TTCAA
views:
start end width
[1] 2 4 3 [TCA]
Finally note that the Biostrings package doesn't provide a tool
to convert a position frequency matrix (that can be obtained with
consensusMatrix) into a position weight matrix.
Hope this helps,
H.
>
> Could somebody clarify what is the expected input for this function?
>
> Thanks in advance,
>
> Zuzanna Makowska
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list