[Bioc-devel] Regarding the function: oligonucleotideFrequency for k-mers > 11 bps

Fri Nov 28 14:15:53 CET 2014

This question is better asked on the Bioconductor support site

   https://support.bioconductor.org

please re-post the question (and great answer from Ulrich!) there. I think 
there's a straight-forward Biostrings solution, too, and I'm sure it will be 
posted when the question appears in the appropriate forum.

Martin

On 11/28/2014 01:24 AM, Ulrich Bodenhofer wrote:
> Dear Rodrigo,
>
> We recently released our 'kebabs' package as part of the Bioconductor 3.0
> release. Though this package is aimed at providing sequence kernels for
> classification, regression, and other tasks such as similarity-based clustering,
> the package includes functionality that exactly matches your needs. See the
> following example:
>
>     s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
>     s1
>     s2 <- DNAString(paste0(rep(s, 200), collate=""))
>     s2
>
>     library(kebabs)
>     sk13 <- spectrumKernel(k=13, normalized=FALSE)
>     system.time(kmerFreq <- drop(getExRep(s1, sk13)))
>     kmerFreq
>     system.time(kmerFreq <- drop(getExRep(s2, sk13)))
>     kmerFreq
>
> So you see that the k-mer frequencies are obtained as the explicit feature
> vector of the standard (unnormalized) spectrum kernel with k=13. This function
> is implemented in highly efficient C++ code that builds up a prefix tree and
> only considers k-mers that actually occur in the sequence (as you requested).
> You see that even for k=13 and a sequence with tens of thousands of bases, the
> computations only take fractions of a second (19 msecs on our 5-year-old Dell
> server). The above function also works for DNAStringSets, but, in this case, you
> should remove the drop() to get a matrix of k-mer frequencies. The matrix is by
> default sparse (class 'dgRMatrix'), but you can also enforce the result to be in
> standard dense matrix format (however, still omitting k-mers that do not occur
> at all in any of the sequences):
>
>     sv <- c(DNAStringSet(s), DNAStringSet(s2))
>     system.time(kmerFreq <- getExRep(sv, sk13))
>     kmerFreq
>     system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
>     kmerFreq
>
> How long the k-mers may be, may depend on your system. On our system, the limit
> seems to be k=22 for DNA sequences. The same works for RNA and amino acid
> sequences. For the latter, however, the limits in terms of k are significantly
> lower, since the feature space is obviously much larger for the same k.
>
> I hope that helps. If you have any further questions, please let us know.
>
> Best regards,
> Ulrich
>
>
>
> Message-ID: <BAY179-W424D25C87634BDEBAD7393C8710 at phx.gbl>
> From: Rodrigo Bertollo de Alexandre <rodrigodealexandre at hotmail.com>
> To: "bioc-devel at r-project.org" <bioc-devel at r-project.org>
> Date: Thu, 27 Nov 2014 21:38:39 -0200
> Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
>      k-mers > 11 bps
>
> I've seen that it is almost impossible to work with k-mers as big as 13 with
> this function. This is mainly because this function doesn't create a list of
> k-mers from the sequence but from all possible combinations.
> This is basically a bug, since in a big sequence of 1000 bps the maximum number
> of 13-mers is L-k+1 = 988. While the number of possible 13-mers is 4^k =
> 28561.This means that the code is basically analyzing 27573 nonexistent k-mers.
> I'm wondering if there could have a modification in the package regarding this
> issue...
> I did my own function for this (which it runs ok). However, having all you need
> in a unique package would be even better...(I posted my code on the
> stackoverflow: http://stackoverflow.com/a/27178731/4004499)
> Sincerely,Rodrigo
>      [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793