[Bioc-devel] Regarding the function: oligonucleotideFrequency for k-mers > 11 bps
Ulrich Bodenhofer
bodenhofer at bioinf.jku.at
Fri Nov 28 10:24:12 CET 2014
Dear Rodrigo,
We recently released our 'kebabs' package as part of the Bioconductor
3.0 release. Though this package is aimed at providing sequence kernels
for classification, regression, and other tasks such as similarity-based
clustering, the package includes functionality that exactly matches your
needs. See the following example:
s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
s1
s2 <- DNAString(paste0(rep(s, 200), collate=""))
s2
library(kebabs)
sk13 <- spectrumKernel(k=13, normalized=FALSE)
system.time(kmerFreq <- drop(getExRep(s1, sk13)))
kmerFreq
system.time(kmerFreq <- drop(getExRep(s2, sk13)))
kmerFreq
So you see that the k-mer frequencies are obtained as the explicit
feature vector of the standard (unnormalized) spectrum kernel with k=13.
This function is implemented in highly efficient C++ code that builds up
a prefix tree and only considers k-mers that actually occur in the
sequence (as you requested). You see that even for k=13 and a sequence
with tens of thousands of bases, the computations only take fractions of
a second (19 msecs on our 5-year-old Dell server). The above function
also works for DNAStringSets, but, in this case, you should remove the
drop() to get a matrix of k-mer frequencies. The matrix is by default
sparse (class 'dgRMatrix'), but you can also enforce the result to be in
standard dense matrix format (however, still omitting k-mers that do not
occur at all in any of the sequences):
sv <- c(DNAStringSet(s), DNAStringSet(s2))
system.time(kmerFreq <- getExRep(sv, sk13))
kmerFreq
system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
kmerFreq
How long the k-mers may be, may depend on your system. On our system,
the limit seems to be k=22 for DNA sequences. The same works for RNA and
amino acid sequences. For the latter, however, the limits in terms of k
are significantly lower, since the feature space is obviously much
larger for the same k.
I hope that helps. If you have any further questions, please let us know.
Best regards,
Ulrich
Message-ID: <BAY179-W424D25C87634BDEBAD7393C8710 at phx.gbl>
From: Rodrigo Bertollo de Alexandre <rodrigodealexandre at hotmail.com>
To: "bioc-devel at r-project.org" <bioc-devel at r-project.org>
Date: Thu, 27 Nov 2014 21:38:39 -0200
Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
k-mers > 11 bps
I've seen that it is almost impossible to work with k-mers as big as 13
with this function. This is mainly because this function doesn't create
a list of k-mers from the sequence but from all possible combinations.
This is basically a bug, since in a big sequence of 1000 bps the maximum
number of 13-mers is L-k+1 = 988. While the number of possible 13-mers
is 4^k = 28561.This means that the code is basically analyzing 27573
nonexistent k-mers.
I'm wondering if there could have a modification in the package
regarding this issue...
I did my own function for this (which it runs ok). However, having all
you need in a unique package would be even better...(I posted my code on
the stackoverflow: http://stackoverflow.com/a/27178731/4004499)
Sincerely,Rodrigo
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list