[BioC] Sequence Distance matrix with large sequences

Hervé Pagès hpages at fhcrc.org
Tue Oct 29 22:48:30 CET 2013


Hi Benjamin,

To compute the Hamming distance between 2 strings in Biostrings, you
can use neditAt():

   > library(BSgenome.Scerevisiae.UCSC.sacCer2)
   > library(BSgenome.Scerevisiae.UCSC.sacCer3)
   > BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII
     562643-letter "DNAString" instance
   seq: 
CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGG
   > BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII
     562643-letter "DNAString" instance
   seq: 
CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGG

Same length! Are the strings the same? We can answer this by computing
the number of mismatches:

   > neditAt(BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII,
             BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII)
   [1] 320140

Note that neditAt() has a 'with.indels' arg that is set to FALSE by
default, hence you get the Hamming distance. Be aware that using
'with.indels=TRUE' to compute the edit (aka Levenshtein) distance
would generally not work because neditAt() performs a global-local
alignment:

   > neditAt("ACGACG", "ACGTACGTTT", with.indels=TRUE)
   [1] 1

FWIW you could also have a look at stringDist() in Biostrings, or at the
stringdist package on CRAN for computing the (Hamming or Levenshtein)
distance matrix of a collection of strings, Don't know how those tools
will scale with sequences hundreds of thousands long though...

Cheers,
H.


On 10/25/2013 06:45 AM, Benjamin Ward (ENV) wrote:
> Hi,
>
> I've been using the DNAbin class and the dist.dna() function in a package I've been making to get a matrix of hamming distances between DNA sequences in a multiple sequence alignment. I've done this with sequences hundreds of thousands long but want to allow the capability to use sequences from genome data i.e. Mbp long. I know there is a Biostring package in the Bioconductor project that is supposed to store very big sequences effectively. Can I do an equivalent job with Bio-strings yielding me such distance information, and can I also identify all the SNPs in an alignment with these large sequences i.e. the segregating sites? If so how?
>
> Many Thanks,
> Ben.
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list