[BioC] How Does one subset a XStringView or PDict object?

Sun Feb 6 19:09:12 CET 2011

Thank you Martin!  That should work nicely; the patternFrequency man page was one I missed.  The showMethods is a good general tip that I can put to use.

Best,

noah

On Feb 4, 2011, at 7:52 PM, Martin Morgan wrote:

> On 02/04/2011 06:40 PM, Noah Dowell wrote:
>> Hello to all,
>> 
>> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome.  From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif.  I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.
>> 
>> Here is a working example of what I have done:
>> 
>> library(BSgenome.Ecoli.NCBI.20080805)
>> 
>> 
>> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655
>> 
>> genome12 <- Ecoli$NC_000913 
>> 
>> consensus <- "TGTTCAAAAAATAAGCA"
>> 
>> TFmotifDict = DNAStringSet(consensus)
>> 
>> 
>> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)
>> 
>> z = extractAllMatches(genome12, TFmotifDict)  
>> 
>> x = PDict(z)			
>> 
>> 
>> 	
>> table(patternFrequency(x))
>> 
>> #    1     2     3     4     5 
>> # 17088   128    60    52    80 
>> 
>> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once.  See the output of the table function above.  I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.
>> 
>> I can do the following to get one:
>> 
>> x[[61]]
>> 
>> Or I can do this:
>> 
>> freq = patternFrequency(x)
>> getit  = which(freq != 1)
>> 
>> But this only tells me which ones they are.  
>> 
>> This could be a pretty basic R task or something specific to these
> types of objects but I seem to be stuck with my newbie R skills. Thank
> you in advance for any help.
> 
> Hi Noah
> 
> I ended up at
> 
>  unique(tb(x)[patternFrequency(x)==5])
> 
> This was mostly from looking at the help page for patternFrequency,
> guided by a little discovery on those that might be relevant to 'x' with
> 
>  showMethods(class=class(x), where=getNamespace("Biostrings"))
> 
> (this last is definitely obscure).
> 
> Martin
> 
>> Best,
>> 
>> Noah
>> 
>> 
>>> sessionInfo()
>> R version 2.12.1 (2010-12-16)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base     
>> 
>> other attached packages:
>> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5                    
>> [3] Biostrings_2.16.9                   GenomicRanges_1.0.7                
>> [5] IRanges_1.6.11                     
>> 
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0 tools_2.12.1 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793