[BioC] PWM matching using matchPWM in Biostrings

Hervé Pagès hpages at fhcrc.org
Wed May 20 03:40:37 CEST 2009


Hi Kristoffer,

Not easy to read/reproduce your code since you don't show what the
c2s() or s2c() functions are. I guess c2s() is turning a character
vector into a single string, e.g. something like

   c2s <- function(x) paste(x, collapse="")

and s2c() is exploding it e.g. something like

   s2c <- function(x) strsplit(x, NULL, fixed=TRUE)[[1]]

more below...

Kristoffer Rigbolt wrote:
> Hi Bioconductor
> 
>  
> 
> I was pleased to find the matchPWM function in the Biostrings package,
> but I have one fundamental question to the function, or maybe to the
> general approach. 
> 
>  
> 
> To test the specificity of the approach I started out from the example
> given in the matchPWM vignette:
> 
>  
> 
> pwm <- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),
> 
>              C=c( 1,  0,  1,  0,  1, 18,  0,  2),
> 
>              G=c(17,  0,  0,  0,  1,  0,  0,  3),
> 
>              T=c( 1, 20,  0,  0,  0,  1,  0,  8))
> 
>  
> 
> chr3R         <- unmasked(Dmelanogaster$chr3R)
> 
>  
> 
> Next, I generated a randomized DNA string with nucleotide frequencies
> identical to chr3R:
> 
>  
> 
> propBP        <- prop.table(table(s2c(as.character(chr3R))))

Note that alphabetFrequency(chr3R, baseOnly=TRUE, freq=TRUE)[1:4]
is *much* faster!

> 
> chr3R.random  <- sample(names(propBP), length(s2c(as.character(chr3R))),
> TRUE, propBP)
> 
> chr3R.random  <- DNAString(c2s(chr3R.random))
> 
>  
> 
> To extract the number of occurrences of the PWM in the two DNA strings
> as a function of the minimum score threshold I used this for-loop:
> 
>  
> 
> outMat        <- cbind(seq(80,100,2), NA, NA)
> 
> for(i in 1:nrow(outMat)){
> 
>      outMat[i,2]   <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1],
> "%")))
> 
>      outMat[i,3]   <- countPWM(pwm, chr3R.random,
> min.score=c2s(c(outMat[i,1], "%")))
> 
> }
> 
>  
> 
> As expected the number of matches decrease as the minimum score are
> increased, but much to my surprise an almost identical number of matches
> of the PWM in the D. melanogaster DNA string and the randomized DNA
> string occurred. I believe this corresponds to a FDR of 100%(!).

The 'pwm' given in the man page of matchPWM() (and that you are using here)
was chosen arbitrarily and is very short i.e. it has a small number of cols.
More precisely the consensus string associated with this 'pwm' is GTAAACAT
and the distribution of such a small pattern in a genome of the size
of the Fly genome can almost be considered random.

So I'm not too surprised that there is no significant difference between
chr3R and your chr3R.random.

> 
>  
> 
>> outMat
> 
>       min.score chr3R randomized
> 
>  [1,]        80 52695      51122
> 
>  [2,]        82 52695      51122
> 
>  [3,]        84 43225      39996
> 
>  [4,]        86 27684      24943
> 
>  [5,]        88 14037      10459
> 
>  [6,]        90  2836       2435
> 
>  [7,]        92  2836       2435
> 
>  [8,]        94  2836       2435
> 
>  [9,]        96  2836       2435
> 
> [10,]        98  1878       1376
> 
> [11,]       100   660        722
> 
>  
> 
> Based on this observation my question is whether this issue relates to a
> property of the matchPWM function itself or if it is an inherent
> property of using PWMs to predict transcription factor binding sites? If
> the latter is the case what is then the scientific value of this
> approach.

I don't think this is a property of the matchPWM() function itself.
You should get almost the same if you counted the occurences of GTAAACAT
with countPattern().

I think this is just due to the length of the pattern: there *must*
be a lot of occurences of such a short pattern. Furthermore, if you
counted the occurences of all the possible 7-mers in a given chromosome
(with oligonucleotideFrequency()), you would always get numbers very
close to what you get with a fake (i.e. randomly-generated) chromosome.
As long as the length of the chromosome is much bigger than 4^7 (at
least 10x or 20x bigger).

Cheers,
H.


> 
>  
> 
> Hope you can help clarify this issue.
> 
>  
> 
> Best regards
> 
> Kristoffer Rigbolt
> 
> University of Southern Denmark
> 
>  
> 
>> sessionInfo()
> 
> R version 2.9.0 (2009-04-17) 
> 
> i386-pc-mingw32 
> 
>  
> 
> locale:
> 
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> Kingdom.1252;LC_MONETARY=English_United
> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
> 
>  
> 
> attached base packages:
> 
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> 
>  
> 
> other attached packages:
> 
> [1] seqinr_2.0-3                          
> 
> [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
> 
> [3] BSgenome_1.10.5                       
> 
> [4] Biostrings_2.10.22                    
> 
> [5] IRanges_1.0.16                        
> 
>  
> 
> loaded via a namespace (and not attached):
> 
> [1] grid_2.9.0         lattice_0.17-25    Matrix_0.999375-26 tools_2.9.0
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list