[BioC] Count differences between sequences

Fri Mar 26 19:13:49 CET 2010

Hi Erik, Patrick,

Patrick Aboyoun wrote:
> Erik,
> Judging from your data, I would gather that you are not interested in 
> indels. Is that correct? You should look at the neditStartingAt 
> function. Something like the following may meet your needs:
> 
> N <- length(myStrings)
> myDists <- matrix(0, nrow = N, ncol = N)
> for (i in 1:(N-1))
>      for (j in (i+1):N)
>          myDists[i, j] <- myDists[j, i] <- 
> neditStartingAt(myStrings[[i]], myStrings[[j]])
> 

Note that you can take advantage of the fact that neditStartingAt() is
vectorized with respect to its second argument:

N <- length(myStrings)
myDists <- sapply(1:N,
     function(i) neditStartingAt(myStrings[[i]], myStrings)))

That will make things hundred times faster with a big "DNA rectangle"
like yours (500x8000):

 > system.time(
     myDists <- sapply(1:N,
         function(i) neditStartingAt(myStrings[[i]], myStrings)))
    user  system elapsed
  12.053   0.000  12.723

 > myDists[1:10, 1:10]
       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  [1,]    0 5888 6055 5947 6152 6248 6038 6175 6268  6047
  [2,] 5888    0 5849 6113 5926 6148 6117 5956 6167  6204
  [3,] 6055 5849    0 6053 6184 5959 6137 6077 5997  6041
  [4,] 5947 6113 6053    0 6111 6167 5910 6096 6121  5959
  [5,] 6152 5926 6184 6111    0 6038 6209 5906 6019  6194
  [6,] 6248 6148 5959 6167 6038    0 6085 6112 5924  6137
  [7,] 6038 6117 6137 5910 6209 6085    0 5961 6192  5947
  [8,] 6175 5956 6077 6096 5906 6112 5961    0 5899  6183
  [9,] 6268 6167 5997 6121 6019 5924 6192 5899    0  5984
[10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984     0

Cheers,
H.

> 
> Patrick
> 
> 
> On 3/25/10 2:57 PM, erikwright at comcast.net wrote:
>> I have 500 DNAStrings, all of length 8000.  I need the entire N x N 
>> distance matrix.
>>
>> Thanks,
>> Erik
>>
>>
>>
>> ----- Original Message -----
>> From: "Patrick Aboyoun" <paboyoun at fhcrc.org>
>> To: erikwright at comcast.net
>> Cc: bioconductor at stat.math.ethz.ch
>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central
>> Subject: Re: [BioC] Count differences between sequences
>>
>> Erik,
>> Could you provide more details on your data? How long are each of the 
>> strings and how many strings do you have? Also, do you need the entire 
>> N x N distance matrix for downstream analysis or are you just looking 
>> for closest relatives?
>>
>>
>> Patrick
>>
>>
>>
>> On 3/25/10 2:29 PM, erikwright at comcast.net wrote:
>>> Hello all,
>>>
>>>
>>> I have a large DNAStringSet and I am trying to calculate its 
>> distance matrix. My DNAStrings are equal width and they are already 
>> aligned.
>>>
>>> I have tried using the stringDist() function, but it is very slow 
>> for large DNAStringSets. Is there a way to quickly calculate the 
>> number of differences between two DNAString instances?
>>>
>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I 
>> would like to know if their is a function other than stringDist() that 
>> will tell me the distance between them is 1.
>>>
>>> Thanks in advance for any help.
>>>
>>>
>>> - Erik
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319