[R] how to group a large list of strings into categories based on string similarity?
Martin Morgan
mtmorgan at fhcrc.org
Thu Jun 24 04:46:30 CEST 2010
On 06/23/2010 06:55 PM, G FANG wrote:
> Hi,
>
> I want to group a large list (20 million) of strings into categories
> based on string similarity?
>
> The specific problem is: given a list of DNA sequence as below
>
> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
> ......
> .....
> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>
> 'N' is the missing letter
>
> It can be seen that some strings are the same except for those N's
> (i.e. N can match with any base)
>
> given this list of string, I want to have
>
> 1) a vector corresponding to each row (string), for each string assign
> an id, such that similar strings (those only differ at N's) have the
> same id
> 2) also get a mapping list from unique strings ('unique' in term of
> the same similarity defined above) to the ids
>
> I am a matlab user shifting to R. Please advice on efficient ways to do this.
The Bioconductor Biostrings package has many tools for this sort of
operation. See http://bioconductor.org/packages/release/Software.html
Maybe a one-time install
source('http://bioconductor.org/biocLite.R')
biocLite('Biostrings')
then
library(Biostrings)
x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
"ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
"CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
"CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
"NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
"CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
"NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
names(x) <- seq_along(x)
dna <- DNAStringSet(x)
while (!all(width(dna) ==
width(dna <- trimLRPatterns("N", "N", dna)))) {}
names(dna)[rank(dna)]
although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
your sequences likely come from a fasta file (Biostrings::readFASTA) or
a text file with a column of sequences (ShortRead::readXStringColumns)
or from alignment software (ShortRead::readAligned /
ShortRead::readFastq). If you go this route you'll want to address
questions to the Bioconductor mailing list
http://bioconductor.org/docs/mailList.html
Martin
> Thanks!
>
> Gang
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list