[BioC] How Does one subset a XStringView or PDict object?

Martin Morgan mtmorgan at fhcrc.org
Sat Feb 5 04:52:38 CET 2011


On 02/04/2011 06:40 PM, Noah Dowell wrote:
> Hello to all,
> 
> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome.  From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif.  I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.
> 
> Here is a working example of what I have done:
> 
> library(BSgenome.Ecoli.NCBI.20080805)
> 
> 
> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655
> 
> genome12 <- Ecoli$NC_000913 
> 
> consensus <- "TGTTCAAAAAATAAGCA"
> 
> TFmotifDict = DNAStringSet(consensus)
> 
> 
> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)
> 
> z = extractAllMatches(genome12, TFmotifDict)  
> 
> x = PDict(z)			
> 
> 
> 	
> table(patternFrequency(x))
> 
> #    1     2     3     4     5 
> # 17088   128    60    52    80 
> 
> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once.  See the output of the table function above.  I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.
> 
> I can do the following to get one:
> 
> x[[61]]
> 
> Or I can do this:
> 
> freq = patternFrequency(x)
> getit  = which(freq != 1)
> 
> But this only tells me which ones they are.  
> 
> This could be a pretty basic R task or something specific to these
types of objects but I seem to be stuck with my newbie R skills. Thank
you in advance for any help.

Hi Noah

I ended up at

  unique(tb(x)[patternFrequency(x)==5])

This was mostly from looking at the help page for patternFrequency,
guided by a little discovery on those that might be relevant to 'x' with

  showMethods(class=class(x), where=getNamespace("Biostrings"))

(this last is definitely obscure).

Martin

> Best,
> 
> Noah
> 
> 
>> sessionInfo()
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5                    
> [3] Biostrings_2.16.9                   GenomicRanges_1.0.7                
> [5] IRanges_1.6.11                     
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 tools_2.12.1 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list