[Bioc-devel] oligonucleotideFrequency Performance Enhancement
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Fri Feb 8 02:59:46 CET 2013
For reference, Jellyfish is supposed to be state of the art for fast
k-mer counting
http://www.cbcb.umd.edu/software/jellyfish/
Kasper
On Thu, Feb 7, 2013 at 6:51 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Dario,
>
>
> On 02/05/2013 05:00 PM, Dario Strbenac wrote:
>>
>> Hello,
>>
>> Would it be possible to include an option that firstly goes through all of
>> the strings and runs a sliding window along them, to find all the unique
>> k-mers present in the dataset ?
>
>
> Finding the unique k-mers in the dataset can easily be done with:
>
> library(Biostrings)
>
> uniqueOligonucleotides <- function(x, width)
> {
> collapsed_freq <- oligonucleotideFrequency(x, width,
> simplify.as="collapsed")
> names(collapsed_freq)[which(collapsed_freq != 0L)]
>
> }
>
>> This would avoid having a sparse matrix with many columns of all zero
>> counts, when a larger value of width is specified.
>
>
> Sounds like a useful addition. Maybe we could support this thru
> a 'drop' arg. When 'drop' is TRUE, it would do something like
> this (building on top of uniqueOligonucleotides() and vcountPDict()):
>
> oligonucleotideFrequency2 <- function(x, width)
> {
> kmers <- uniqueOligonucleotides(x, width)
> pdict <- PDict(kmers)
> ans <- t(vcountPDict(pdict, x))
> colnames(ans) <- kmers
> ans
> }
>
> Then:
>
> > library(hgu95av2probe)
> > probes <- DNAStringSet(hgu95av2probe)
>
> > dim(freq1 <- oligonucleotideFrequency(head(probes), 5))
> [1] 6 1024
>
> > dim(freq2 <- oligonucleotideFrequency2(head(probes), 5))
> [1] 6 99
>
> > identical(freq2, freq1[ , colnames(freq2)])
> [1] TRUE
>
> > all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L)
> [1] TRUE
>
> Added to my TODO list.
>
> Thanks,
>
> H.
>
>>
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list