[BioC] How Does one subset a XStringView or PDict object?
Noah Dowell
noahd at UCLA.EDU
Sat Feb 5 03:40:42 CET 2011
Hello to all,
I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.
Here is a working example of what I have done:
library(BSgenome.Ecoli.NCBI.20080805)
# create and object to work with one genome: Ecoli str. K-12 substr. MG1655
genome12 <- Ecoli$NC_000913
consensus <- "TGTTCAAAAAATAAGCA"
TFmotifDict = DNAStringSet(consensus)
ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)
z = extractAllMatches(genome12, TFmotifDict)
x = PDict(z)
table(patternFrequency(x))
# 1 2 3 4 5
# 17088 128 60 52 80
So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.
I can do the following to get one:
x[[61]]
Or I can do this:
freq = patternFrequency(x)
getit = which(freq != 1)
But this only tells me which ones they are.
This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills. Thank you in advance for any help.
Best,
Noah
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5
[3] Biostrings_2.16.9 GenomicRanges_1.0.7
[5] IRanges_1.6.11
loaded via a namespace (and not attached):
[1] Biobase_2.8.0 tools_2.12.1
More information about the Bioconductor
mailing list