[BioC] How Does one subset a XStringView or PDict object?
Noah Dowell
noahd at ucla.edu
Sun Feb 6 19:09:12 CET 2011
Thank you Martin! That should work nicely; the patternFrequency man page was one I missed. The showMethods is a good general tip that I can put to use.
Best,
noah
On Feb 4, 2011, at 7:52 PM, Martin Morgan wrote:
> On 02/04/2011 06:40 PM, Noah Dowell wrote:
>> Hello to all,
>>
>> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations.
>>
>> Here is a working example of what I have done:
>>
>> library(BSgenome.Ecoli.NCBI.20080805)
>>
>>
>> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655
>>
>> genome12 <- Ecoli$NC_000913
>>
>> consensus <- "TGTTCAAAAAATAAGCA"
>>
>> TFmotifDict = DNAStringSet(consensus)
>>
>>
>> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7)
>>
>> z = extractAllMatches(genome12, TFmotifDict)
>>
>> x = PDict(z)
>>
>>
>>
>> table(patternFrequency(x))
>>
>> # 1 2 3 4 5
>> # 17088 128 60 52 80
>>
>> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns.
>>
>> I can do the following to get one:
>>
>> x[[61]]
>>
>> Or I can do this:
>>
>> freq = patternFrequency(x)
>> getit = which(freq != 1)
>>
>> But this only tells me which ones they are.
>>
>> This could be a pretty basic R task or something specific to these
> types of objects but I seem to be stuck with my newbie R skills. Thank
> you in advance for any help.
>
> Hi Noah
>
> I ended up at
>
> unique(tb(x)[patternFrequency(x)==5])
>
> This was mostly from looking at the help page for patternFrequency,
> guided by a little discovery on those that might be relevant to 'x' with
>
> showMethods(class=class(x), where=getNamespace("Biostrings"))
>
> (this last is definitely obscure).
>
> Martin
>
>> Best,
>>
>> Noah
>>
>>
>>> sessionInfo()
>> R version 2.12.1 (2010-12-16)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5
>> [3] Biostrings_2.16.9 GenomicRanges_1.0.7
>> [5] IRanges_1.6.11
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0 tools_2.12.1
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
More information about the Bioconductor
mailing list