[Bioc-devel] oligonucleotideFrequency Performance Enhancement

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Feb 8 02:59:46 CET 2013


For reference, Jellyfish is supposed to be state of the art for fast
k-mer counting
  http://www.cbcb.umd.edu/software/jellyfish/

Kasper

On Thu, Feb 7, 2013 at 6:51 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Dario,
>
>
> On 02/05/2013 05:00 PM, Dario Strbenac wrote:
>>
>> Hello,
>>
>> Would it be possible to include an option that firstly goes through all of
>> the strings and runs a sliding window along them, to find all the unique
>> k-mers present in the dataset ?
>
>
> Finding the unique k-mers in the dataset can easily be done with:
>
>   library(Biostrings)
>
>   uniqueOligonucleotides <- function(x, width)
>   {
>     collapsed_freq <- oligonucleotideFrequency(x, width,
> simplify.as="collapsed")
>     names(collapsed_freq)[which(collapsed_freq != 0L)]
>
>   }
>
>> This would avoid having a sparse matrix with many columns of all zero
>> counts, when a larger value of width is specified.
>
>
> Sounds like a useful addition. Maybe we could support this thru
> a 'drop' arg. When 'drop' is TRUE, it would do something like
> this (building on top of uniqueOligonucleotides() and vcountPDict()):
>
>   oligonucleotideFrequency2 <- function(x, width)
>   {
>     kmers <- uniqueOligonucleotides(x, width)
>     pdict <- PDict(kmers)
>     ans <- t(vcountPDict(pdict, x))
>     colnames(ans) <- kmers
>     ans
>   }
>
> Then:
>
>   > library(hgu95av2probe)
>   > probes <- DNAStringSet(hgu95av2probe)
>
>   > dim(freq1 <- oligonucleotideFrequency(head(probes), 5))
>   [1]    6 1024
>
>   > dim(freq2 <- oligonucleotideFrequency2(head(probes), 5))
>   [1]  6 99
>
>   > identical(freq2, freq1[ , colnames(freq2)])
>   [1] TRUE
>
>   > all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L)
>   [1] TRUE
>
> Added to my TODO list.
>
> Thanks,
>
> H.
>
>>
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list