[BioC] How Does one subset a XStringView or PDict object?
Martin Morgan
mtmorgan at fhcrc.org
Sat Feb 5 04:52:38 CET 2011
On 02/04/2011 06:40 PM, Noah Dowell wrote:
> Hello to all,
>
> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.
>
> Here is a working example of what I have done:
>
> library(BSgenome.Ecoli.NCBI.20080805)
>
>
> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655
>
> genome12 <- Ecoli$NC_000913
>
> consensus <- "TGTTCAAAAAATAAGCA"
>
> TFmotifDict = DNAStringSet(consensus)
>
>
> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)
>
> z = extractAllMatches(genome12, TFmotifDict)
>
> x = PDict(z)
>
>
>
> table(patternFrequency(x))
>
> # 1 2 3 4 5
> # 17088 128 60 52 80
>
> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.
>
> I can do the following to get one:
>
> x[[61]]
>
> Or I can do this:
>
> freq = patternFrequency(x)
> getit = which(freq != 1)
>
> But this only tells me which ones they are.
>
> This could be a pretty basic R task or something specific to these
types of objects but I seem to be stuck with my newbie R skills. Thank
you in advance for any help.
Hi Noah
I ended up at
unique(tb(x)[patternFrequency(x)==5])
This was mostly from looking at the help page for patternFrequency,
guided by a little discovery on those that might be relevant to 'x' with
showMethods(class=class(x), where=getNamespace("Biostrings"))
(this last is definitely obscure).
Martin
> Best,
>
> Noah
>
>
>> sessionInfo()
> R version 2.12.1 (2010-12-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5
> [3] Biostrings_2.16.9 GenomicRanges_1.0.7
> [5] IRanges_1.6.11
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 tools_2.12.1
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list