[BioC] How Does one subset a XStringView or PDict object?

Sat Feb 5 03:40:42 CET 2011

Hello to all,

I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome.  From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif.  I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.

Here is a working example of what I have done:

library(BSgenome.Ecoli.NCBI.20080805)

# create and object to work with one genome: Ecoli str. K-12 substr. MG1655

genome12 <- Ecoli$NC_000913 

consensus <- "TGTTCAAAAAATAAGCA"

TFmotifDict = DNAStringSet(consensus)

ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)

z = extractAllMatches(genome12, TFmotifDict)  

x = PDict(z)			

table(patternFrequency(x))

#    1     2     3     4     5 
# 17088   128    60    52    80 

So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once.  See the output of the table function above.  I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.

I can do the following to get one:

x[[61]]

Or I can do this:

freq = patternFrequency(x)
getit  = which(freq != 1)

But this only tells me which ones they are.  

 This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills.  Thank you in advance for any help.

Best,

Noah

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5                    
[3] Biostrings_2.16.9                   GenomicRanges_1.0.7                
[5] IRanges_1.6.11                     

loaded via a namespace (and not attached):
[1] Biobase_2.8.0 tools_2.12.1