[R] Comparing/diffing strings
Martin Morgan
mtmorgan at fhcrc.org
Tue Aug 24 18:25:01 CEST 2010
On 08/24/2010 07:27 AM, Doran, Harold wrote:
> There is the stringMatch function in the MiscPsycho package.
>
>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'no')
> [1] 8
>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'yes')
> [1] 0.4285714
>
> It uses Levenshtein distance to tell you how much they differ by, either normalized or not. So, the above two tell you the first string differs from the second string by 8 insertions/deletions/substitutions. The second number normalizes the comparison such that 1 denotes perfect agreement and 2 denotes imperfect agreement.
>
> Examples of an exact match are below.
>
>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'yes')
> [1] 1
>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'n')
> [1] 0
You're probably looking for something lighter weight, but Bioconductor
Biostrings has pairwiseAlignment.
> library(Biostrings)
> pairwiseAlignment("Hadley Wickham", "Hadley Hamwick")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] Hadley W---ick
subject: [1] Hadley Hamwick
score: 29.5102
> pairwiseAlignment("Hadley Hamwick", "Hadley Wickham")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] Hadley Hamwick
subject: [1] Hadley W---ick
score: 29.5102
> aln <- pairwiseAlignment("Hadley Hamwick", "Haderley Hamwich")
> consensusMatrix(aln)["-",]
[1] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
Martin
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Hadley Wickham
> Sent: Tuesday, August 24, 2010 10:17 AM
> To: R-help
> Subject: [R] Comparing/diffing strings
>
> Hi all,
>
> all.equal is generally very useful when you want to find the
> differences between two objects. It breaks down however, when you
> have two long strings to compare:
>
>> all.equal(a, b)
> [1] "1 string mismatch"
>
> Does any one know of any good text diffing tools implemented in R?
>
> Thanks,
>
> Hadley
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list