[BioC] PWM matching using matchPWM in Biostrings
Patrick Aboyoun
paboyoun at fhcrc.org
Wed May 20 03:52:08 CEST 2009
Kristoffer,
I don't know the details of the matchPWM and countPWM examples in the
man page for matchPWM, but my guess is that they were created to show
how the software is used rather than demonstrating a hunt for a real TF
binding site in D. melanogaster using a PWM. From my experience, PWM are
useful for a first-order approximation of where transcription factors
bind. One nice aspect they have is they lend themselves to relatively
fast computations. The downside is that the strong PWM assumption of
positional independence most likely does not hold for the TF-binding
site of interest, so it is hard to believe the hits you get constitute
reality.
Patrick
Kristoffer Rigbolt wrote:
> Hi Bioconductor
>
>
>
> I was pleased to find the matchPWM function in the Biostrings package,
> but I have one fundamental question to the function, or maybe to the
> general approach.
>
>
>
> To test the specificity of the approach I started out from the example
> given in the matchPWM vignette:
>
>
>
> pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7),
>
> C=c( 1, 0, 1, 0, 1, 18, 0, 2),
>
> G=c(17, 0, 0, 0, 1, 0, 0, 3),
>
> T=c( 1, 20, 0, 0, 0, 1, 0, 8))
>
>
>
> chr3R <- unmasked(Dmelanogaster$chr3R)
>
>
>
> Next, I generated a randomized DNA string with nucleotide frequencies
> identical to chr3R:
>
>
>
> propBP <- prop.table(table(s2c(as.character(chr3R))))
>
> chr3R.random <- sample(names(propBP), length(s2c(as.character(chr3R))),
> TRUE, propBP)
>
> chr3R.random <- DNAString(c2s(chr3R.random))
>
>
>
> To extract the number of occurrences of the PWM in the two DNA strings
> as a function of the minimum score threshold I used this for-loop:
>
>
>
> outMat <- cbind(seq(80,100,2), NA, NA)
>
> for(i in 1:nrow(outMat)){
>
> outMat[i,2] <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1],
> "%")))
>
> outMat[i,3] <- countPWM(pwm, chr3R.random,
> min.score=c2s(c(outMat[i,1], "%")))
>
> }
>
>
>
> As expected the number of matches decrease as the minimum score are
> increased, but much to my surprise an almost identical number of matches
> of the PWM in the D. melanogaster DNA string and the randomized DNA
> string occurred. I believe this corresponds to a FDR of 100%(!).
>
>
>
>
>> outMat
>>
>
> min.score chr3R randomized
>
> [1,] 80 52695 51122
>
> [2,] 82 52695 51122
>
> [3,] 84 43225 39996
>
> [4,] 86 27684 24943
>
> [5,] 88 14037 10459
>
> [6,] 90 2836 2435
>
> [7,] 92 2836 2435
>
> [8,] 94 2836 2435
>
> [9,] 96 2836 2435
>
> [10,] 98 1878 1376
>
> [11,] 100 660 722
>
>
>
> Based on this observation my question is whether this issue relates to a
> property of the matchPWM function itself or if it is an inherent
> property of using PWMs to predict transcription factor binding sites? If
> the latter is the case what is then the scientific value of this
> approach.
>
>
>
> Hope you can help clarify this issue.
>
>
>
> Best regards
>
> Kristoffer Rigbolt
>
> University of Southern Denmark
>
>
>
>
>> sessionInfo()
>>
>
> R version 2.9.0 (2009-04-17)
>
> i386-pc-mingw32
>
>
>
> locale:
>
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> Kingdom.1252;LC_MONETARY=English_United
> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
>
>
>
> other attached packages:
>
> [1] seqinr_2.0-3
>
> [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>
> [3] BSgenome_1.10.5
>
> [4] Biostrings_2.10.22
>
> [5] IRanges_1.0.16
>
>
>
> loaded via a namespace (and not attached):
>
> [1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26 tools_2.9.0
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list