Hi Bioconductor

 
I was pleased to find the matchPWM function in the Biostrings package,
but I have one fundamental question to the function, or maybe to the
general approach. 

 
To test the specificity of the approach I started out from the example
given in the matchPWM vignette:

 
pwm <- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),

             C=c( 1,  0,  1,  0,  1, 18,  0,  2),

             G=c(17,  0,  0,  0,  1,  0,  0,  3),

             T=c( 1, 20,  0,  0,  0,  1,  0,  8))

 
chr3R         <- unmasked(Dmelanogaster$chr3R)

 
Next, I generated a randomized DNA string with nucleotide frequencies
identical to chr3R:

 
propBP        <- prop.table(table(s2c(as.character(chr3R))))

chr3R.random  <- sample(names(propBP), length(s2c(as.character(chr3R))),
TRUE, propBP)

chr3R.random  <- DNAString(c2s(chr3R.random))

 
To extract the number of occurrences of the PWM in the two DNA strings
as a function of the minimum score threshold I used this for-loop:

 
outMat        <- cbind(seq(80,100,2), NA, NA)

for(i in 1:nrow(outMat)){

     outMat[i,2]   <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1],
"%")))

     outMat[i,3]   <- countPWM(pwm, chr3R.random,
min.score=c2s(c(outMat[i,1], "%")))

}

 
As expected the number of matches decrease as the minimum score are
increased, but much to my surprise an almost identical number of matches
of the PWM in the D. melanogaster DNA string and the randomized DNA
string occurred. I believe this corresponds to a FDR of 100%(!).

 
> outMat

      min.score chr3R randomized

 [1,]        80 52695      51122

 [2,]        82 52695      51122

 [3,]        84 43225      39996

 [4,]        86 27684      24943

 [5,]        88 14037      10459

 [6,]        90  2836       2435

 [7,]        92  2836       2435

 [8,]        94  2836       2435

 [9,]        96  2836       2435

[10,]        98  1878       1376

[11,]       100   660        722

 
Based on this observation my question is whether this issue relates to a
property of the matchPWM function itself or if it is an inherent
property of using PWMs to predict transcription factor binding sites? If
the latter is the case what is then the scientific value of this
approach.

 
Hope you can help clarify this issue.

 
Best regards

Kristoffer Rigbolt

University of Southern Denmark

 
> sessionInfo()

R version 2.9.0 (2009-04-17) 

i386-pc-mingw32 

 
locale:

LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
Kingdom.1252;LC_MONETARY=English_United
Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

 
attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base


other attached packages:

[1] seqinr_2.0-3                          

[2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11

[3] BSgenome_1.10.5                       

[4] Biostrings_2.10.22                    

[5] IRanges_1.0.16                        

 
loaded via a namespace (and not attached):

[1] grid_2.9.0         lattice_0.17-25    Matrix_0.999375-26 tools_2.9.0


	[[alternative HTML version deleted]]