[BioC] Count differences between sequences
Hervé Pagès
hpages at fhcrc.org
Tue Mar 30 02:23:52 CEST 2010
Erik,
This is much faster:
countLeadingLetter2 <- function(x, letter="-")
{
ans <- which(!isMatchingAt(letter, x, seq_len(length(x))))[1L]
if (is.na(ans))
return(length(x))
ans - 1L
}
countTrailingLetter2 <- function(x, letter="-")
countLeadingLetter2(reverse(x), letter=letter)
Cheers,
H.
Hervé Pagès wrote:
> Erik Wright wrote:
>> Hi everyone,
>>
>> Since I don't want to include terminal gaps into my distance
>> calculation, is there a function that will quickly return the number
>> of leading gaps and/or trailing gaps?
>> For example,
>> seq1 <- DNAString("--AT-CG-")
>> number of leading terminal gaps = 2
>> number of trailing terminal gaps = 1
>
> Maybe not the most efficient way to do this but the trimLRPatterns()
> function can be used for this:
>
> countLeadingLetter <- function(x, letter="-")
> {
> Lpattern <- DNAString(paste(rep.int("-", length(x)), collapse=""))
> rg <- trimLRPatterns(Lpattern=Lpattern , subject=x, ranges=TRUE)
> start(rg) - 1L
> }
>
> countTrailingLetter <- function(x, letter="-")
> {
> Rpattern <- DNAString(paste(rep.int("-", length(x)), collapse=""))
> rg <- trimLRPatterns(Rpattern=Rpattern , subject=x, ranges=TRUE)
> length(x) - end(rg)
> }
>
> Note that when using this in the context of your distance
> calculation you need to be careful that adding
> countLeadingLetter() + countTrailingLetter()
> won't be correct when the string contains only gaps (very
> unlikely to happen but still).
>
> Also in this context, since the left and right fake patterns
> are always going to be the same, it would be better to generate
> them once for all before starting the big "for" loop.
>
> I didn't do any timing so please let us know if this is too slow.
>
> Cheers,
> H.
>
>>
>> Thanks again!,
>> Erik
>>
>>
>> On Mar 29, 2010, at 5:39 PM, Hervé Pagès wrote:
>>
>>> Hi Erik,
>>>
>>> To address your "gap problem", one solution is to replace the "-" in
>>> one of the 2 strings to compare with a letter that is guaranteed not
>>> to be in the other string, e.g. the "+" letter (yes, this is part of
>>> the DNA alphabet, and if it was not, you could always work with
>>> BStringSet objects). This can be done with chartr():
>>>
>>> numF <- length(myDNAStringSet)
>>>
>>> distMatrix <- matrix(integer(numF*numF), nrow=numF)
>>>
>>> for (i in 1:(numF-1)) {
>>> pattern <- chartr("-", "+", myDNAStringSet[[i]])
>>> # use IUPAC_CODE_MAP with fixed=FALSE
>>> distMatrix[i, (i+1):numF] <- neditStartingAt(
>>> pattern,
>>> myDNAStringSet[(i+1):numF],
>>> fixed=FALSE)
>>> distMatrix[(i+1):numF, i] <- distMatrix[i, (i+1):numF]
>>> }
>>>
>>> diag(distMatrix) <- 0L # I think you want 0 here, not 1
>>>
>>> Takes about 15 sec. on my machine for your 500x8000 DNA rectangle.
>>>
>>> Note that I'm using an integer matrix which is twice smaller than a
>>> numeric/double matrix.
>>>
>>> You could maybe get a small speed improvement by doing this replacement
>>> once for all at the DNAStringSet level before you start the loop with:
>>>
>>> myDNAStringSet2 <- chartr("-", "+", myDNAStringSet)
>>>
>>> and by adapting the code in the loop but I don't expect this to make a
>>> big difference.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> erikwright at comcast.net wrote:
>>>> Hi Patrick, Michael, Hervé, and Martin, Wow! Thanks very much
>>>> everyone for putting so much effort into answering my question.
>>>> While we wait for Patrick's update of stringDist to become
>>>> available, I will describe my solution and a related problem I have
>>>> run into. I am using this piece of code to do most of the work: numF
>>>> <- length ( myDNAStringSet ) for ( i in 1 :( numF - 1 )) { # use
>>>> IUPAC_CODE_MAP with fixed=FALSE distMatrix [ i ,( i + 1 ): numF ] <-
>>>> neditStartingAt ( myDNAStringSet [[ i ]], myDNAStringSet [( i + 1 ):
>>>> numF ], fixed = FALSE ) distMatrix [( i + 1 ): numF , i ] <-
>>>> distMatrix [ i ,( i + 1 ): numF ] } diag ( distMatrix ) <- 1 This
>>>> will populate a matrix with the number of differences between
>>>> strings. The code works reasonably fast - about 15 seconds for a
>>>> DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with
>>>> 8,000 sequences of length 8,000 is exponentially slower - about 45
>>>> minutes. Obviously, if Patrick's update of stringDist improves the spe
> ed it will greatly help. In playing around with this I have run into a
> related problem that you all might be able to help me with. As I stated
> in my original email, I am trying to calculate a Distance Matrix with a
> large aligned DNAStringSet. My DNAStringSet does not contain ideal data,
> meaning there are many gap characters ("-"). If I find the edit
> difference between two strings that have gaps in the same places then
> the gaps are not counted towards the edit distance. This makes complete
> sense because "-" == "-". My problem is that I want to include any gap
> characters in my edit distance. For example:
>>>>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y)
>>>> [1] 0 The distance between these two strings for my distance
>>>> measurement should be 2. That is because the gap character ("-")
>>>> gives me no data, so I cannot say that the distance between the two
>>>> strings is 0. Can anyone think of a good method of counting common
>>>> gap characters toward the edit distance? Unfortunately I cannot
>>>> simply count up the number of gap characters in my pattern and add
>>>> it to my edit distance. For example:
>>>>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y)
>>>> [1] 1 The distance between these two strings for my distance
>>>> measurement should be 2. If I were to add the number of gap
>>>> characters in the pattern to the edit distance I would get 2 + 1 =
>>>> 3. So basically I want to count any gap character in the pattern or
>>>> subject toward the edit distance because it gives me no information.
>>>> Does this make sense, and does anyone have an idea for how to do
>>>> this? Thanks again, Erik ----- Original Message ----- From: "Patrick
>>>> Aboyoun" <paboyoun at fhcrc.org> To: "Michael Lawrence"
>>>> <lawrence.michael at gene.com> Cc: "BioC list"
>>>> <bioconductor at stat.math.ethz.ch> Sent: Saturday, March 27, 2010
>>>> 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count
>>>> differences between sequences I just added a Hamming distance metric
>>>> to the stringDist function in BioC 2.6, which should be available on
>>>> bioconductor.org in 36 hours. This uses the code underlying the
>>>> neditStartingAt functions and loops over the lower triangle of the
>>>> distance matrix in C so it is f
> aster than the sapply method Herve provided. To calculate this newly
> added distance measure, specify stringDist(x, method = "hamming") where
> x is an XStringSet object containing equal length strings.
> library(Biostrings) library(drosophila2probe) ch0 <-
> drosophila2probe[seq_len(500*8000/25), "sequence"] ch <-
> unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <-
> DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <-
> stringDist(dna, method = "hamming")) # user system elapsed # 5.459 0.029
> 5.541 sessionInfo() R version 2.11.0 Under development (unstable)
> (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1]
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base
> packages: [1] stats gra
>>> phics grDevices utils datasets methods base other attached packages:
>>> [1] drosophila2probe_2.5.0 AnnotationDbi_1.9.6 Biobase_2.7.5 [4]
>>> Biostrings_2.15.26 IRanges_1.5.72 loaded via a namespace (and not
>>> attached): [1] DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On
>>> 3/26/10 1:35 PM, Michael Lawrence wrote: > 2010/3/26 Herv� Pag�s
>>> > > >> Hi Erik, Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>>
>>> Erik, >>> Judging from your data, I would gather that you are not
>>> interested in >>> indels. Is that correct? You should look at the
>>> neditStartingAt function. >>> Something like the following may meet
>>> your needs: >>> >>> N<- length(myStrings) >>> myDists<- matrix(0,
>>> nrow = N, ncol = N) >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>>
>>> myDists[i, j]<- myDists[j, i]<- neditStartingAt(myStrings[[i]], >>>
>>> myStrings[[j]]) >>> >>> >>> >> Note that you can take advantage of
>>> the fact that neditStartingAt() is >> vectorized with respect to its
>>> second argument: >> >> N<- length(myStrings) >>
> myD
>>> ists<- sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]],
>>> myStrings))) >> >> That will make things hundred times faster with a
>>> big "DNA rectangle" >> like yours (500x8000): >> >> >>> system.time(
>>> >>> >> myDists<- sapply(1:N, >> function(i)
>>> neditStartingAt(myStrings[[i]], myStrings))) >> user system elapsed
>>> >> 12.053 0.000 12.723 >> >> >>> myDists[1:10, 1:10] >>> >> [,1] [,2]
>>> [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] >> [1,] 0 5888 6055 5947
>>> 6152 6248 6038 6175 6268 6047 >> [2,] 5888 0 5849 6113 5926 6148 6117
>>> 5956 6167 6204 >> [3,] 6055 5849 0 6053 6184 5959 6137 6077 5997 6041
>>> >> [4,] 5947 6113 6053 0 6111 6167 5910 6096 6121 5959 >> [5,] 6152
>>> 5926 6184 6111 0 6038 6209 5906 6019 6194 >> [6,] 6248 6148 5959 6167
>>> 6038 0 6085 6112 5924 6137 >> [7,] 6038 6117 6137 5910 6209 6085 0
>>> 5961 6192 5947 >> [8,] 6175 5956 6077 6096 5906 6112 5961 0 5899 6183
>>> >> [9,] 6268 6167 5997 6121 6019 5924 6192 5899 0 5984 >> [10,] 6047
>>> 6204 6041 5959 6194 6137 5947 6183 5984 0 >> >> >>
>> W
>>> ow, nice. It sounds to me like this would be a common use case; how
>>> hard > would it be to vectorize both arguments? > > Michael > > > >>
>>> Cheers, >> H. >> >> >> >> >>> Patrick >>> >>> >>> On 3/25/10 2:57 PM,
>>> erikwright at comcast.net wrote: >>> >>> >>>> I have 500 DNAStrings, all
>>> of length 8000. I need the entire N x N >>>> distance matrix. >>>>
>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>> >>>> ----- Original Message
>>> ----- >>>> From: "Patrick Aboyoun" >>>> To: erikwright at comcast.net
>>> >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Sent: Thursday, March
>>> 25, 2010 4:45:29 PM GMT -06:00 US/Canada Central >>>> Subject: Re:
>>> [BioC] Count differences between sequences >>>> >>>> Erik, >>>> Could
>>> you provide more details on your data? How long are each of the >>>>
>>> strings and how many strings do you have? Also, do you need the
>>> entire N x N >>>> distance matrix for downstream analysis or are you
>>> just looking for closest >>>> relatives? >>>> >>>> >>>> Patrick >>>>
>>> >>>> >>>> >>>> On 3/25/10 2:29 PM,
> eri
>>> kwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>> >>>>>
>>> >>>>> I have a large DNAStringSet and I am trying to calculate its
>>> >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width and
>>> they are already >>>> aligned. >>>> >>>> >>>>> I have tried using the
>>> stringDist() function, but it is very slow >>>>> >>>>> >>>> for large
>>> DNAStringSets. Is there a way to quickly calculate the number >>>> of
>>> differences between two DNAString instances? >>>> >>>> >>>>> For
>>> example, let's say I have two DNAStrings: "ACAC" and "ACAG". I >>>>>
>>> >>>>> >>>> would like to know if their is a function other than
>>> stringDist() that >>>> will tell me the distance between them is 1.
>>> >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> -
>>> Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>>
>>> _______________________________________________ >>>>> Bioconductor
>>> mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Se
> arc
>>> h the archives: >>>>> >>>>> >>>>
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>>
>>> _______________________________________________ >>> Bioconductor
>>> mailing list >>> Bioconductor at stat.math.ethz.ch >>>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the
>>> archives: >>>
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>
>>> >>> >> -- >> Herv� Pag�s >> >> Program in Computational Biology
>>> >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer
>>> Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >>
>>> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206)
>>> 667-5791 >> Fax: (206) 667-1319 >> >> >>
>>> _______________________________________________ >> Bioconductor
>>> mailing list >> Bioconductor at stat.math.ethz.ch >>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the
>>> archives: >> http://news.gmane.org/gmane.science.biology.inform
> ati
>>> cs.conductor >> >> > [[alternative HTML version deleted]] > > > > >
>>> _______________________________________________ > Bioconductor
>>> mailing list > Bioconductor at stat.math.ethz.ch >
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the
>>> archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> [[alternative HTML version deleted]]
>>>> _______________________________________________ Bioconductor mailing
>>>> list Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>>> archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> [[alternative HTML version deleted]]
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list