[BioC] Count differences between sequences
Erik Wright
eswright at wisc.edu
Mon Mar 29 14:21:37 CEST 2010
Hi Martin,
Thanks for sharing your method of calculating a distance matrix. I would never have thought of implementing the distance matrix in such a way. Correct me if I am wrong, but any algorithm for computing a distance matrix will take at least polynomial time because it inherently requires comparing N to (N-1) strings.
My problem has to do with the handling of gap characters in Biostrings methods. I would like to treat the gap character as a wildcard. So, in another sense, what I am looking for is a way to extend the IUPAC_CODE_MAP to include the gap character:
IUPAC_CODE_MAP <- c(IUPAC_CODE_MAP, "-"="ACGT")
That way neditAt("A","-") would be 0 and not 1, and neditAt("-","-") would remain zero. Perhaps the easiest way to do this would be to somehow replace all my gaps with the "N" character.
Thanks!,
Erik
On Mar 28, 2010, at 4:23 PM, Martin Morgan wrote:
> On 03/28/2010 01:17 PM, Martin Morgan wrote:
>> Hi Erik et al.,
>>
>> Not offering anything practical, but...
>>
>> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>>> Hi Patrick, Michael, Hervé, and Martin,
>
>> f0 <- function(dna)
>> {
>> l <- unique(width(dna))
>> w <- length(dna)
>>
>> dna0 <- factor(unlist(strsplit(as.character(dna), "")))
>> idx <- seq(1, l * w, l) # 'column' offsets
>> i <- seq_len(w)
>>
>> d <- matrix(l, w, w)
>> for (j in seq_len(l)-1L) {
>> grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
>> ## for (k in grp)
>> ## d[k, k] = d[k, k] - 1L
>> }
>> d
>> }
>>
>> Internally, 'split' does its work in 2N comparisons. This important part
>> is that comparisons are accomplished in linear rather than polynomial time.
>
>> same nucleotide (i.e., in k). This is unfortunately the slow part of the
>> revised code, and it is polynomial time -- in the best case, for 4
>> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
>> and in the worst case (all nucleotides identical) we do N^2 updates. We
>> pay a heavy price for doing this at the R level, and basically get
>> swamped by the update rather than sort cost.
>
>> But we shouldn't lose track of the fact that we can do the sorting much
>> more efficiently. I've also looked a little at sort.list(,
>> method="radix"), which orders a factor very quickly.
>
> dampening my enthusiasm a bit here, as even implemented in C the update
> operation still dominates, and the algorithm remains polynomial.
>
> Martin
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list