[Bioc-devel] oligonucleotideFrequency Performance Enhancement

Hervé Pagès hpages at fhcrc.org
Fri Feb 8 00:51:21 CET 2013


Hi Dario,

On 02/05/2013 05:00 PM, Dario Strbenac wrote:
> Hello,
>
> Would it be possible to include an option that firstly goes through all of the strings and runs a sliding window along them, to find all the unique k-mers present in the dataset ?

Finding the unique k-mers in the dataset can easily be done with:

   library(Biostrings)

   uniqueOligonucleotides <- function(x, width)
   {
     collapsed_freq <- oligonucleotideFrequency(x, width, 
simplify.as="collapsed")
     names(collapsed_freq)[which(collapsed_freq != 0L)]
   }

> This would avoid having a sparse matrix with many columns of all zero counts, when a larger value of width is specified.

Sounds like a useful addition. Maybe we could support this thru
a 'drop' arg. When 'drop' is TRUE, it would do something like
this (building on top of uniqueOligonucleotides() and vcountPDict()):

   oligonucleotideFrequency2 <- function(x, width)
   {
     kmers <- uniqueOligonucleotides(x, width)
     pdict <- PDict(kmers)
     ans <- t(vcountPDict(pdict, x))
     colnames(ans) <- kmers
     ans
   }

Then:

   > library(hgu95av2probe)
   > probes <- DNAStringSet(hgu95av2probe)

   > dim(freq1 <- oligonucleotideFrequency(head(probes), 5))
   [1]    6 1024

   > dim(freq2 <- oligonucleotideFrequency2(head(probes), 5))
   [1]  6 99

   > identical(freq2, freq1[ , colnames(freq2)])
   [1] TRUE

   > all(freq1[ , setdiff(colnames(freq1), colnames(freq2))] == 0L)
   [1] TRUE

Added to my TODO list.

Thanks,
H.

>
> --------------------------------------
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list