[R] how to group a large list of strings into categories based on string similarity?

Thu Jun 24 04:55:44 CEST 2010

On 06/23/2010 07:46 PM, Martin Morgan wrote:
> On 06/23/2010 06:55 PM, G FANG wrote:
>> Hi,
>>
>> I want to group a large list (20 million) of strings into categories
>> based on string similarity?
>>
>> The specific problem is: given a list of DNA sequence as below
>>
>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
>> ......
>> .....
>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>>
>> 'N' is the missing letter
>>
>> It can be seen that some strings are the same except for those N's
>> (i.e. N can match with any base)
>>
>> given this list of string, I want to have
>>
>> 1) a vector corresponding to each row (string), for each string assign
>> an id, such that similar strings (those only differ at N's) have the
>> same id
>> 2) also get a mapping list from unique strings ('unique' in term of
>> the same similarity defined above) to the ids
>>
>> I am a matlab user shifting to R. Please advice on efficient ways to do this.
> 
> The Bioconductor Biostrings package has many tools for this sort of
> operation. See http://bioconductor.org/packages/release/Software.html
> 
> Maybe a one-time install
> 
>    source('http://bioconductor.org/biocLite.R')
>    biocLite('Biostrings')
> 
> then
> 
>   library(Biostrings)
>   x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
>         "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
>         "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
>         "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
>         "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
>         "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
>         "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
>   names(x) <- seq_along(x)
>   dna <- DNAStringSet(x)
>   while (!all(width(dna) ==
>               width(dna <- trimLRPatterns("N", "N", dna)))) {}
>   names(dna)[rank(dna)]

oops, maybe closer to

   names(dna)[order(dna)[rank(dna, ties.method="min")]]

> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
> your sequences likely come from a fasta file (Biostrings::readFASTA) or
> a text file with a column of sequences (ShortRead::readXStringColumns)
> or from alignment software (ShortRead::readAligned /
> ShortRead::readFastq). If you go this route you'll want to address
> questions to the Bioconductor mailing list
> 
>   http://bioconductor.org/docs/mailList.html
> 
> Martin
> 
>> Thanks!
>>
>> Gang
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793