[BioC] Count differences between sequences
Hervé Pagès
hpages at fhcrc.org
Fri Mar 26 19:13:49 CET 2010
Hi Erik, Patrick,
Patrick Aboyoun wrote:
> Erik,
> Judging from your data, I would gather that you are not interested in
> indels. Is that correct? You should look at the neditStartingAt
> function. Something like the following may meet your needs:
>
> N <- length(myStrings)
> myDists <- matrix(0, nrow = N, ncol = N)
> for (i in 1:(N-1))
> for (j in (i+1):N)
> myDists[i, j] <- myDists[j, i] <-
> neditStartingAt(myStrings[[i]], myStrings[[j]])
>
Note that you can take advantage of the fact that neditStartingAt() is
vectorized with respect to its second argument:
N <- length(myStrings)
myDists <- sapply(1:N,
function(i) neditStartingAt(myStrings[[i]], myStrings)))
That will make things hundred times faster with a big "DNA rectangle"
like yours (500x8000):
> system.time(
myDists <- sapply(1:N,
function(i) neditStartingAt(myStrings[[i]], myStrings)))
user system elapsed
12.053 0.000 12.723
> myDists[1:10, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047
[2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204
[3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041
[4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959
[5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194
[6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137
[7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947
[8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183
[9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984
[10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0
Cheers,
H.
>
> Patrick
>
>
> On 3/25/10 2:57 PM, erikwright at comcast.net wrote:
>> I have 500 DNAStrings, all of length 8000. I need the entire N x N
>> distance matrix.
>>
>> Thanks,
>> Erik
>>
>>
>>
>> ----- Original Message -----
>> From: "Patrick Aboyoun" <paboyoun at fhcrc.org>
>> To: erikwright at comcast.net
>> Cc: bioconductor at stat.math.ethz.ch
>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central
>> Subject: Re: [BioC] Count differences between sequences
>>
>> Erik,
>> Could you provide more details on your data? How long are each of the
>> strings and how many strings do you have? Also, do you need the entire
>> N x N distance matrix for downstream analysis or are you just looking
>> for closest relatives?
>>
>>
>> Patrick
>>
>>
>>
>> On 3/25/10 2:29 PM, erikwright at comcast.net wrote:
>>> Hello all,
>>>
>>>
>>> I have a large DNAStringSet and I am trying to calculate its
>> distance matrix. My DNAStrings are equal width and they are already
>> aligned.
>>>
>>> I have tried using the stringDist() function, but it is very slow
>> for large DNAStringSets. Is there a way to quickly calculate the
>> number of differences between two DNAString instances?
>>>
>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I
>> would like to know if their is a function other than stringDist() that
>> will tell me the distance between them is 1.
>>>
>>> Thanks in advance for any help.
>>>
>>>
>>> - Erik
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list