[Bioc-devel] oligonucleotideFrequency Performance Enhancement
Hervé Pagès
hpages at fhcrc.org
Fri Feb 8 00:51:21 CET 2013
Hi Dario,
On 02/05/2013 05:00 PM, Dario Strbenac wrote:
> Hello,
>
> Would it be possible to include an option that firstly goes through all of the strings and runs a sliding window along them, to find all the unique k-mers present in the dataset ?
Finding the unique k-mers in the dataset can easily be done with:
library(Biostrings)
uniqueOligonucleotides <- function(x, width)
{
collapsed_freq <- oligonucleotideFrequency(x, width,
simplify.as="collapsed")
names(collapsed_freq)[which(collapsed_freq != 0L)]
}
> This would avoid having a sparse matrix with many columns of all zero counts, when a larger value of width is specified.
Sounds like a useful addition. Maybe we could support this thru
a 'drop' arg. When 'drop' is TRUE, it would do something like
this (building on top of uniqueOligonucleotides() and vcountPDict()):
oligonucleotideFrequency2 <- function(x, width)
{
kmers <- uniqueOligonucleotides(x, width)
pdict <- PDict(kmers)
ans <- t(vcountPDict(pdict, x))
colnames(ans) <- kmers
ans
}
Then:
> library(hgu95av2probe)
> probes <- DNAStringSet(hgu95av2probe)
> dim(freq1 <- oligonucleotideFrequency(head(probes), 5))
[1] 6 1024
> dim(freq2 <- oligonucleotideFrequency2(head(probes), 5))
[1] 6 99
> identical(freq2, freq1[ , colnames(freq2)])
[1] TRUE
> all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L)
[1] TRUE
Added to my TODO list.
Thanks,
H.
>
> --------------------------------------
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list