[BioC] PWM matching using matchPWM in Biostrings

Wed May 20 03:52:08 CEST 2009

Kristoffer,
I don't know the details of the matchPWM and countPWM examples in the 
man page for matchPWM, but my guess is that they were created to show 
how the software is used rather than demonstrating a hunt for a real TF 
binding site in D. melanogaster using a PWM. From my experience, PWM are 
useful for a first-order approximation of where transcription factors 
bind. One nice aspect they have is they lend themselves to relatively 
fast computations. The downside is that the strong PWM assumption of 
positional independence most likely does not hold for the TF-binding 
site of interest, so it is hard to believe the hits you get constitute 
reality.

Patrick

Kristoffer Rigbolt wrote:
> Hi Bioconductor
>
>  
>
> I was pleased to find the matchPWM function in the Biostrings package,
> but I have one fundamental question to the function, or maybe to the
> general approach. 
>
>  
>
> To test the specificity of the approach I started out from the example
> given in the matchPWM vignette:
>
>  
>
> pwm <- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),
>
>              C=c( 1,  0,  1,  0,  1, 18,  0,  2),
>
>              G=c(17,  0,  0,  0,  1,  0,  0,  3),
>
>              T=c( 1, 20,  0,  0,  0,  1,  0,  8))
>
>  
>
> chr3R         <- unmasked(Dmelanogaster$chr3R)
>
>  
>
> Next, I generated a randomized DNA string with nucleotide frequencies
> identical to chr3R:
>
>  
>
> propBP        <- prop.table(table(s2c(as.character(chr3R))))
>
> chr3R.random  <- sample(names(propBP), length(s2c(as.character(chr3R))),
> TRUE, propBP)
>
> chr3R.random  <- DNAString(c2s(chr3R.random))
>
>  
>
> To extract the number of occurrences of the PWM in the two DNA strings
> as a function of the minimum score threshold I used this for-loop:
>
>  
>
> outMat        <- cbind(seq(80,100,2), NA, NA)
>
> for(i in 1:nrow(outMat)){
>
>      outMat[i,2]   <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1],
> "%")))
>
>      outMat[i,3]   <- countPWM(pwm, chr3R.random,
> min.score=c2s(c(outMat[i,1], "%")))
>
> }
>
>  
>
> As expected the number of matches decrease as the minimum score are
> increased, but much to my surprise an almost identical number of matches
> of the PWM in the D. melanogaster DNA string and the randomized DNA
> string occurred. I believe this corresponds to a FDR of 100%(!).
>
>  
>
>   
>> outMat
>>     
>
>       min.score chr3R randomized
>
>  [1,]        80 52695      51122
>
>  [2,]        82 52695      51122
>
>  [3,]        84 43225      39996
>
>  [4,]        86 27684      24943
>
>  [5,]        88 14037      10459
>
>  [6,]        90  2836       2435
>
>  [7,]        92  2836       2435
>
>  [8,]        94  2836       2435
>
>  [9,]        96  2836       2435
>
> [10,]        98  1878       1376
>
> [11,]       100   660        722
>
>  
>
> Based on this observation my question is whether this issue relates to a
> property of the matchPWM function itself or if it is an inherent
> property of using PWMs to predict transcription factor binding sites? If
> the latter is the case what is then the scientific value of this
> approach.
>
>  
>
> Hope you can help clarify this issue.
>
>  
>
> Best regards
>
> Kristoffer Rigbolt
>
> University of Southern Denmark
>
>  
>
>   
>> sessionInfo()
>>     
>
> R version 2.9.0 (2009-04-17) 
>
> i386-pc-mingw32 
>
>  
>
> locale:
>
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> Kingdom.1252;LC_MONETARY=English_United
> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
>  
>
> attached base packages:
>
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>  
>
> other attached packages:
>
> [1] seqinr_2.0-3                          
>
> [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>
> [3] BSgenome_1.10.5                       
>
> [4] Biostrings_2.10.22                    
>
> [5] IRanges_1.0.16                        
>
>  
>
> loaded via a namespace (and not attached):
>
> [1] grid_2.9.0         lattice_0.17-25    Matrix_0.999375-26 tools_2.9.0
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>