[R] how to group a large list of strings into categories based on string similarity?
Martin Morgan
mtmorgan at fhcrc.org
Thu Jun 24 04:55:44 CEST 2010
On 06/23/2010 07:46 PM, Martin Morgan wrote:
> On 06/23/2010 06:55 PM, G FANG wrote:
>> Hi,
>>
>> I want to group a large list (20 million) of strings into categories
>> based on string similarity?
>>
>> The specific problem is: given a list of DNA sequence as below
>>
>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
>> ......
>> .....
>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>>
>> 'N' is the missing letter
>>
>> It can be seen that some strings are the same except for those N's
>> (i.e. N can match with any base)
>>
>> given this list of string, I want to have
>>
>> 1) a vector corresponding to each row (string), for each string assign
>> an id, such that similar strings (those only differ at N's) have the
>> same id
>> 2) also get a mapping list from unique strings ('unique' in term of
>> the same similarity defined above) to the ids
>>
>> I am a matlab user shifting to R. Please advice on efficient ways to do this.
>
> The Bioconductor Biostrings package has many tools for this sort of
> operation. See http://bioconductor.org/packages/release/Software.html
>
> Maybe a one-time install
>
> source('http://bioconductor.org/biocLite.R')
> biocLite('Biostrings')
>
> then
>
> library(Biostrings)
> x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
> "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
> "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
> "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
> "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
> "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
> "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
> names(x) <- seq_along(x)
> dna <- DNAStringSet(x)
> while (!all(width(dna) ==
> width(dna <- trimLRPatterns("N", "N", dna)))) {}
> names(dna)[rank(dna)]
oops, maybe closer to
names(dna)[order(dna)[rank(dna, ties.method="min")]]
> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
> your sequences likely come from a fasta file (Biostrings::readFASTA) or
> a text file with a column of sequences (ShortRead::readXStringColumns)
> or from alignment software (ShortRead::readAligned /
> ShortRead::readFastq). If you go this route you'll want to address
> questions to the Bioconductor mailing list
>
> http://bioconductor.org/docs/mailList.html
>
> Martin
>
>> Thanks!
>>
>> Gang
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list