[R] Comparing/diffing strings

Hadley Wickham hadley at rice.edu
Wed Aug 25 02:34:27 CEST 2010


On Tue, Aug 24, 2010 at 11:25 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 08/24/2010 07:27 AM, Doran, Harold wrote:
>> There is the stringMatch function in the MiscPsycho package.
>>
>>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'no')
>> [1] 8
>>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'yes')
>> [1] 0.4285714
>>
>> It uses Levenshtein distance to tell you how much they differ by, either normalized or not. So, the above two tell you the first string differs from the second string by 8 insertions/deletions/substitutions. The second number normalizes the comparison such that 1 denotes perfect agreement and 2 denotes imperfect agreement.
>>
>> Examples of an exact match are below.
>>
>>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'yes')
>> [1] 1
>>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'n')
>> [1] 0
>
> You're probably looking for something lighter weight, but Bioconductor
> Biostrings has pairwiseAlignment.
>
>> library(Biostrings)
>> pairwiseAlignment("Hadley Wickham", "Hadley Hamwick")
> Global PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] Hadley W---ick
> subject: [1] Hadley Hamwick
> score: 29.5102
>
>> pairwiseAlignment("Hadley Hamwick", "Hadley Wickham")
> Global PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] Hadley Hamwick
> subject: [1] Hadley W---ick
> score: 29.5102
>
>> aln <- pairwiseAlignment("Hadley Hamwick", "Haderley Hamwich")
>> consensusMatrix(aln)["-",]
>  [1] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

Thanks all for the suggestions!

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/



More information about the R-help mailing list