[BioC] Count differences between sequences
Martin Morgan
mtmorgan at fhcrc.org
Sun Mar 28 23:23:19 CEST 2010
On 03/28/2010 01:17 PM, Martin Morgan wrote:
> Hi Erik et al.,
>
> Not offering anything practical, but...
>
> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>> Hi Patrick, Michael, Hervé, and Martin,
> f0 <- function(dna)
> {
> l <- unique(width(dna))
> w <- length(dna)
>
> dna0 <- factor(unlist(strsplit(as.character(dna), "")))
> idx <- seq(1, l * w, l) # 'column' offsets
> i <- seq_len(w)
>
> d <- matrix(l, w, w)
> for (j in seq_len(l)-1L) {
> grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
> ## for (k in grp)
> ## d[k, k] = d[k, k] - 1L
> }
> d
> }
>
> Internally, 'split' does its work in 2N comparisons. This important part
> is that comparisons are accomplished in linear rather than polynomial time.
> same nucleotide (i.e., in k). This is unfortunately the slow part of the
> revised code, and it is polynomial time -- in the best case, for 4
> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
> and in the worst case (all nucleotides identical) we do N^2 updates. We
> pay a heavy price for doing this at the R level, and basically get
> swamped by the update rather than sort cost.
> But we shouldn't lose track of the fact that we can do the sorting much
> more efficiently. I've also looked a little at sort.list(,
> method="radix"), which orders a factor very quickly.
dampening my enthusiasm a bit here, as even implemented in C the update
operation still dominates, and the algorithm remains polynomial.
Martin
More information about the Bioconductor
mailing list