[Bioc-devel] Regarding the function: oligonucleotideFrequency for k-mers > 11 bps

Ulrich Bodenhofer bodenhofer at bioinf.jku.at
Fri Nov 28 10:24:12 CET 2014

Dear Rodrigo,

We recently released our 'kebabs' package as part of the Bioconductor 
3.0 release. Though this package is aimed at providing sequence kernels 
for classification, regression, and other tasks such as similarity-based 
clustering, the package includes functionality that exactly matches your 
needs. See the following example:

    s2 <- DNAString(paste0(rep(s, 200), collate=""))

    sk13 <- spectrumKernel(k=13, normalized=FALSE)
    system.time(kmerFreq <- drop(getExRep(s1, sk13)))
    system.time(kmerFreq <- drop(getExRep(s2, sk13)))

So you see that the k-mer frequencies are obtained as the explicit 
feature vector of the standard (unnormalized) spectrum kernel with k=13. 
This function is implemented in highly efficient C++ code that builds up 
a prefix tree and only considers k-mers that actually occur in the 
sequence (as you requested). You see that even for k=13 and a sequence 
with tens of thousands of bases, the computations only take fractions of 
a second (19 msecs on our 5-year-old Dell server). The above function 
also works for DNAStringSets, but, in this case, you should remove the 
drop() to get a matrix of k-mer frequencies. The matrix is by default 
sparse (class 'dgRMatrix'), but you can also enforce the result to be in 
standard dense matrix format (however, still omitting k-mers that do not 
occur at all in any of the sequences):

    sv <- c(DNAStringSet(s), DNAStringSet(s2))
    system.time(kmerFreq <- getExRep(sv, sk13))
    system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))

How long the k-mers may be, may depend on your system. On our system, 
the limit seems to be k=22 for DNA sequences. The same works for RNA and 
amino acid sequences. For the latter, however, the limits in terms of k 
are significantly lower, since the feature space is obviously much 
larger for the same k.

I hope that helps. If you have any further questions, please let us know.

Best regards,

Message-ID: <BAY179-W424D25C87634BDEBAD7393C8710 at phx.gbl>
From: Rodrigo Bertollo de Alexandre <rodrigodealexandre at hotmail.com>
To: "bioc-devel at r-project.org" <bioc-devel at r-project.org>
Date: Thu, 27 Nov 2014 21:38:39 -0200
Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
     k-mers > 11 bps

I've seen that it is almost impossible to work with k-mers as big as 13 
with this function. This is mainly because this function doesn't create 
a list of k-mers from the sequence but from all possible combinations.
This is basically a bug, since in a big sequence of 1000 bps the maximum 
number of 13-mers is L-k+1 = 988. While the number of possible 13-mers 
is 4^k = 28561.This means that the code is basically analyzing 27573 
nonexistent k-mers.
I'm wondering if there could have a modification in the package 
regarding this issue...
I did my own function for this (which it runs ok). However, having all 
you need in a unique package would be even better...(I posted my code on 
the stackoverflow: http://stackoverflow.com/a/27178731/4004499)
     [[alternative HTML version deleted]]

Bioc-devel at r-project.org mailing list

More information about the Bioc-devel mailing list