[R] how to group a large list of strings into categories based on string similarity?
Martin Morgan
mtmorgan at fhcrc.org
Sun Jun 27 16:10:28 CEST 2010
On 06/26/2010 11:42 AM, G FANG wrote:
> Hi Martin,
>
> Thanks a lot for your advice.
>
> I tried the process you suggested as below, it worked, but in a
> different way that I planned.
>
> library(Biostrings)
> x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
> "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
> "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
> "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
> "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
> "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
> "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
> names(x) <- seq_along(x)
> dna <- DNAStringSet(x)
> while (!all(width(dna) == width(dna <- trimLRPatterns("N", "N", dna)))) {}
> names(dna)[order(dna)[rank(dna, ties.method="min")]]
>
> The output is,
> "1" "2" "3" "4" "4" "6" "4", this is the right answer after trimining
> N's, i.e. without considering N, which strings are the same.
>
> But actually, the match I planned is position-to-position match, i.e.
> 1st and 2nd strings are the same except for the N's
>
> So, the expected output is 1 1 2 2 3 2 4
>
> Please advice.
what would you expect of
ACTG
ACTC
ACTN
? I'd guess an elegant solution would require a fancy data structure /
algorithm (suffix arrays, I guess), and that you'll end up doing
something ad-hoc, along the lines of
x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
"ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
"CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
"CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
"NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
"CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
"NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
names(x) = seq_along(x)
dna = DNAStringSet(x)
wd <- unique(width(dna))
stopifnot(1 == length(wd))
nm <- names(dna)[order(dna)[rank(dna)]]
for (i in rev(seq_len(wd)[-1])) {
seq = narrow(dna, 1, i-1)
n = narrow(dna, i, len)
allN = alphabetFrequency(n, baseOnly=TRUE)[,"other"] == width(n)
nm[allN] <- names(seq)[order(seq)[rank(seq)]][allN]
}
which gives
> as.integer(factor(nm))
[1] 1 1 2 2 3 2 4
(this doesn't deal with the leading N's, but the logic might be the same).
>>> If you go this route you'll want to address
>>> questions to the Bioconductor mailing list
>>>
>>> http://bioconductor.org/docs/mailList.html
Martin
>
> Thanks!
>
> --gang
>
> On Wed, Jun 23, 2010 at 7:55 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 06/23/2010 07:46 PM, Martin Morgan wrote:
>>> On 06/23/2010 06:55 PM, G FANG wrote:
>>>> Hi,
>>>>
>>>> I want to group a large list (20 million) of strings into categories
>>>> based on string similarity?
>>>>
>>>> The specific problem is: given a list of DNA sequence as below
>>>>
>>>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
>>>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
>>>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
>>>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
>>>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
>>>> ......
>>>> .....
>>>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
>>>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
>>>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
>>>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>>>>
>>>> 'N' is the missing letter
>>>>
>>>> It can be seen that some strings are the same except for those N's
>>>> (i.e. N can match with any base)
>>>>
>>>> given this list of string, I want to have
>>>>
>>>> 1) a vector corresponding to each row (string), for each string assign
>>>> an id, such that similar strings (those only differ at N's) have the
>>>> same id
>>>> 2) also get a mapping list from unique strings ('unique' in term of
>>>> the same similarity defined above) to the ids
>>>>
>>>> I am a matlab user shifting to R. Please advice on efficient ways to do this.
>>>
>>> The Bioconductor Biostrings package has many tools for this sort of
>>> operation. See http://bioconductor.org/packages/release/Software.html
>>>
>>> Maybe a one-time install
>>>
>>> source('http://bioconductor.org/biocLite.R')
>>> biocLite('Biostrings')
>>>
>>> then
>>>
>>> library(Biostrings)
>>> x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
>>> "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
>>> "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
>>> "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
>>> "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
>>> "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
>>> "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
>>> names(x) <- seq_along(x)
>>> dna <- DNAStringSet(x)
>>> while (!all(width(dna) ==
>>> width(dna <- trimLRPatterns("N", "N", dna)))) {}
>>> names(dna)[rank(dna)]
>>
>> oops, maybe closer to
>>
>> names(dna)[order(dna)[rank(dna, ties.method="min")]]
>>
>>> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
>>> your sequences likely come from a fasta file (Biostrings::readFASTA) or
>>> a text file with a column of sequences (ShortRead::readXStringColumns)
>>> or from alignment software (ShortRead::readAligned /
>>> ShortRead::readFastq). If you go this route you'll want to address
>>> questions to the Bioconductor mailing list
>>>
>>> http://bioconductor.org/docs/mailList.html
>>>
>>> Martin
>>>
>>>> Thanks!
>>>>
>>>> Gang
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list