[BioC] Count differences between sequences

Tue Mar 30 00:39:22 CEST 2010

Hi Erik,

To address your "gap problem", one solution is to replace the "-" in
one of the 2 strings to compare with a letter that is guaranteed not
to be in the other string, e.g. the "+" letter (yes, this is part of
the DNA alphabet, and if it was not, you could always work with
BStringSet objects). This can be done with chartr():

   numF <- length(myDNAStringSet)

   distMatrix <- matrix(integer(numF*numF), nrow=numF)

   for (i in 1:(numF-1)) {
     pattern <- chartr("-", "+", myDNAStringSet[[i]])
     # use IUPAC_CODE_MAP with fixed=FALSE
     distMatrix[i, (i+1):numF] <- neditStartingAt(
                                    pattern,
                                    myDNAStringSet[(i+1):numF],
                                    fixed=FALSE)
     distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF]
   }

   diag(distMatrix) <- 0L  # I think you want 0 here, not 1

Takes about 15 sec. on my machine for your 500x8000 DNA rectangle.

Note that I'm using an integer matrix which is twice smaller than a
numeric/double matrix.

You could maybe get a small speed improvement by doing this replacement
once for all at the DNAStringSet level before you start the loop with:

   myDNAStringSet2 <- chartr("-", "+", myDNAStringSet)

and by adapting the code in the loop but I don't expect this to make a
big difference.

Cheers,
H.

erikwright at comcast.net wrote:
> Hi Patrick, Michael, HervÃ©, and Martin, 
> 
> 
> Wow! Thanks very much everyone for putting so much effort into answering my question. While we wait for Patrick's update of stringDist to become available, I will describe my solution and a related problem I have run into. 
> 
> 
> I am using this piece of code to do most of the work: 
> 
> 
> 
> 
> 
> 
> numF <- length ( myDNAStringSet ) 
> 
> for ( i in 1 :( numF - 1 )) { 
> 
> # use IUPAC_CODE_MAP with fixed=FALSE 
> 
> distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt ( 
> 
> myDNAStringSet [[ i ]], 
> 
> myDNAStringSet [( i + 1 ): numF ], 
> 
> fixed = FALSE ) 
> 
> distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF ] 
> 
> 
> } 
> 
> diag ( distMatrix ) <- 1 
> 
> 
> This will populate a matrix with the number of differences between strings. The code works reasonably fast - about 15 seconds for a DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with 8,000 sequences of length 8,000 is exponentially slower - about 45 minutes. Obviously, if Patrick's update of stringDist improves the speed it will greatly help. 
> 
> 
> In playing around with this I have run into a related problem that you all might be able to help me with. As I stated in my original email, I am trying to calculate a Distance Matrix with a large aligned DNAStringSet. My DNAStringSet does not contain ideal data, meaning there are many gap characters ("-"). If I find the edit difference between two strings that have gaps in the same places then the gaps are not counted towards the edit distance. This makes complete sense because "-" == "-". My problem is that I want to include any gap characters in my edit distance. For example: 
> 
> 
> 
>> x <- DNAString("--A") 
>> y <- DNAString("--A") 
>> neditStartingAt(x,y) 
> [1] 0 
> 
> 
> The distance between these two strings for my distance measurement should be 2. That is because the gap character ("-") gives me no data, so I cannot say that the distance between the two strings is 0. Can anyone think of a good method of counting common gap characters toward the edit distance? Unfortunately I cannot simply count up the number of gap characters in my pattern and add it to my edit distance. For example: 
> 
> 
> 
> 
>> x <- DNAString("AC--") 
>> y <- DNAString("ACC-") 
>> neditStartingAt(x,y) 
> [1] 1 
> 
> 
> The distance between these two strings for my distance measurement should be 2. If I were to add the number of gap characters in the pattern to the edit distance I would get 2 + 1 = 3. So basically I want to count any gap character in the pattern or subject toward the edit distance because it gives me no information. Does this make sense, and does anyone have an idea for how to do this? 
> 
> 
> Thanks again, 
> Erik 
> 
> 
> 
> ----- Original Message ----- 
> From: "Patrick Aboyoun" <paboyoun at fhcrc.org> 
> To: "Michael Lawrence" <lawrence.michael at gene.com> 
> Cc: "BioC list" <bioconductor at stat.math.ethz.ch> 
> Sent: Saturday, March 27, 2010 7:39:16 PM GMT -06:00 US/Canada Central 
> Subject: Re: [BioC] Count differences between sequences 
> 
> I just added a Hamming distance metric to the stringDist function in BioC 2.6, which should be available on bioconductor.org in 36 hours. This uses the code underlying the neditStartingAt functions and loops over the lower triangle of the distance matrix in C so it is faster than the sapply method Herve provided. To calculate this newly added distance measure, specify stringDist(x, method = "hamming") where x is an XStringSet object containing equal length strings. library(Biostrings) library(drosophila2probe) ch0 <- drosophila2probe[seq_len(500*8000/25), "sequence"] ch <- unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <- DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <- stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029 5.541 sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats gra
phics grDevices utils datasets methods base other attached packages: [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Hervï¿½ Pagï¿½s > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging from your data, I would gather that you are not interested in >>> indels. Is that correct? You should look at the neditStartingAt function. >>> Something like the following may meet your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of the fact that neditStartingAt() is >> vectorized with respect to its second argument: >> >> N<- length(myStrings) >> myD
ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That will make things hundred times faster with a big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >> > W
ow, nice. It sounds to me like this would be a common use case; how hard > would it be to vectorize both arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re: [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could you provide more details on your data? How long are each of the >>>> strings and how many strings do you have? Also, do you need the entire N x N >>>> distance matrix for downstream analysis or are you just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29 PM, eri
kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the stringDist() function, but it is very slow >>>>> >>>>> >>>> for large DNAStringSets. Is there a way to quickly calculate the number >>>> of differences between two DNAString instances? >>>> >>>> >>>>> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>> >>>>> >>>> would like to know if their is a function other than stringDist() that >>>> will tell me the distance between them is 1. >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> - Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Searc
h the archives: >>>>> >>>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> -- >> Hervï¿½ Pagï¿½s >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informati
cs.conductor >> >> > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] 
> _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 	[[alternative HTML version deleted]]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319