[BioC] Sequence Distance matrix with large sequences
Hervé Pagès
hpages at fhcrc.org
Tue Oct 29 22:48:30 CET 2013
Hi Benjamin,
To compute the Hamming distance between 2 strings in Biostrings, you
can use neditAt():
> library(BSgenome.Scerevisiae.UCSC.sacCer2)
> library(BSgenome.Scerevisiae.UCSC.sacCer3)
> BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII
562643-letter "DNAString" instance
seq:
CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGG
> BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII
562643-letter "DNAString" instance
seq:
CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGTGG
Same length! Are the strings the same? We can answer this by computing
the number of mismatches:
> neditAt(BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII,
BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII)
[1] 320140
Note that neditAt() has a 'with.indels' arg that is set to FALSE by
default, hence you get the Hamming distance. Be aware that using
'with.indels=TRUE' to compute the edit (aka Levenshtein) distance
would generally not work because neditAt() performs a global-local
alignment:
> neditAt("ACGACG", "ACGTACGTTT", with.indels=TRUE)
[1] 1
FWIW you could also have a look at stringDist() in Biostrings, or at the
stringdist package on CRAN for computing the (Hamming or Levenshtein)
distance matrix of a collection of strings, Don't know how those tools
will scale with sequences hundreds of thousands long though...
Cheers,
H.
On 10/25/2013 06:45 AM, Benjamin Ward (ENV) wrote:
> Hi,
>
> I've been using the DNAbin class and the dist.dna() function in a package I've been making to get a matrix of hamming distances between DNA sequences in a multiple sequence alignment. I've done this with sequences hundreds of thousands long but want to allow the capability to use sequences from genome data i.e. Mbp long. I know there is a Biostring package in the Bioconductor project that is supposed to store very big sequences effectively. Can I do an equivalent job with Bio-strings yielding me such distance information, and can I also identify all the SNPs in an alignment with these large sequences i.e. the segregating sites? If so how?
>
> Many Thanks,
> Ben.
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list