比较多个单词名称与 Levenshtein 距离

Mos*_*she 5 string algorithm objective-c levenshtein-distance

我正在将校园内的建筑物名称与来自各种数据库的输入进行比较。人们输入这些名称,每个人都使用自己的缩写方案。我试图从用户输入到名称的规范形式中找到最佳匹配。

我已经实现了一个递归的 Levenshtein 距离方法,但是我正在尝试解决一些边缘情况。我的实现在 GitHub 上

一些建筑物的名称是一个词,而另一些则是两个词。一个词上的一个词会产生相当准确的结果,但我需要记住两件事。

  1. 缩写:假设输入是名称的缩写版本,有时我可以在输入和任意名称之间获得相同的 Levenshtein 距离,以及正确的名称。例如,如果我的输入是“ Ing”并且建筑物名称1. are ["Boylan", "Ingersoll", "Whitman", "Whitehead", "Roosevelt", and "Library"],我最终得到 6 的 LDBoylanIngersoll。想要的结果就在这里Ingersoll

  2. 多字字符串:第二个边缘情况是输入和/或输出是两个单词,用空格分隔。例如,New Ing是 的缩写New Ingersoll。在这种情况下,New Ingersoll 和 Boylan 的 Levenshtein 距离得分均为 6。如果我要拆分字符串,则完美New匹配New,然后我只需要参考我之前的边缘情况的解决方案。

处理这两种边缘情况的最佳方法是什么?

1. 对于好奇的人来说,这些是纽约市布鲁克林学院的建筑。

Jos*_*ell 6

我认为你应该使用最长公共子序列的长度而不是编辑距离。对于您的情况来说,这似乎是一个更好的指标。本质上,正如我在评论中所建议的,它优先考虑插入和删除而不是替换。

\n\n

它清楚地区分了“Ing”->“Ingersoll”和“Ing”->“Boylan”(得分为 3 和 1),处理空格没有问题(“New Ing”->“New Ingersoll”得分 7,其中“New Ing” ” -> “Boylan” 再次得分 1),并且如果您遇到像“Ingsl”这样的缩写,也能很好地工作。

\n\n

该算法很简单。如果两个字符串的长度为mn,则按字符比较字符串的连续前缀(从空前缀开始),将分数保留在大小为m+1, n+1的矩阵中。如果特定对匹配,则将前两个前缀的分数加一(矩阵中向上一行,留下一列);否则保留这些前缀的两个分数中的最高分数(比较紧邻上面的条目和紧邻左边的条目并取最好的)。当您检查完两个字符串后,分数矩阵中的最后一项就是 LCS 的长度。

\n\n

“Ingsll”和“Ingersoll”的分数矩阵示例:

\n\n
\n 0 1 2 3 4 5 6\n ngsll\n --------------\n0 | 0 0 0 0 0 0 0\n1 我 | 0 1 1 1 1 1 1\n2 n | 0 1 1 1 1 1 1\n2 0 1 2 2 2 2 2\n3 克 | 0 1 2 3 3 3 3\n4 e | 0 1 2 3 3 3 3\n5 r | 0 1 2 3 3 3 3\n6 秒 | 0 1 2 3 4 4 4\n7 o | 0 1 2 3 4 4 4\n8 l | 0 1 2 3 4 5 5\n9 l | 0 1 2 3 4 5 6\n
\n\n

这是长度的 ObjC 实现。这里的大部分复杂性只是由于想要@"o\xcc\xb6"正确处理组合字符序列(例如)。

\n\n
#import <Foundation/Foundation.h>\n\n@interface NSString (WSSComposedLength)\n\n- (NSUInteger)WSSComposedLength;\n\n@end\n\n@implementation NSString (WSSComposedLength)\n\n- (NSUInteger)WSSComposedLength\n{\n    __block NSUInteger length = 0;\n    [self enumerateSubstringsInRange:(NSRange){0, [self length]}\n                             options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired\n                          usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {\n                              length++;\n                          }];\n\n    return length;\n}\n\n@end\n\n\n@interface NSString (WSSLongestCommonSubsequence)\n\n- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;\n- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;\n\n@end\n\n@implementation NSString (WSSLongestCommonSubsequence)\n\n- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target\n{\n    NSUInteger * const * scores;\n    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];\n\n    return scores[[target WSSComposedLength]][[self WSSComposedLength]];\n}\n\n- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target\n{\n    NSUInteger * const * scores;\n    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];\n\n    //FIXME: Implement this.\n\n    return nil;\n}\n\n- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{\n\n    NSUInteger selfLength = [self WSSComposedLength];\n    NSUInteger targetLength = [target WSSComposedLength];\n    NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];\n    NSUInteger ** scores = [scoresData mutableBytes];\n\n    for( NSUInteger i = 0; i <= targetLength; i++ ){\n        scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];\n    }\n\n    /* Ranges in the enumeration Block are the same measure as\n     * -[NSString length] -- i.e., 16-bit code units -- as opposed to\n     * _composed_ length, which counts code points. Thus:\n     *\n     * Enumeration will miss the last character if composed length is used\n     * as the range and there\'s a substring range with length greater than one.\n     */\n    NSRange selfFullRange = (NSRange){0, [self length]};\n    NSRange targetFullRange = (NSRange){0, [target length]};\n    /* Have to keep track of these indexes by hand, rather than using the\n     * Block\'s substringRange.location because, e.g., @"o\xcc\xb6", will have\n     * range {x, 2}, and the next substring will be {x+2, l}.\n     */\n    __block NSUInteger col = 0;\n    __block NSUInteger row = 0;\n    [target enumerateSubstringsInRange:targetFullRange\n                             options:NSStringEnumerationByComposedCharacterSequences\n                          usingBlock:^(NSString * targetSubstring,\n                                       NSRange targetSubstringRange,\n                                       NSRange _, BOOL * _0)\n        {\n            row++;\n            col = 0;\n\n            [self enumerateSubstringsInRange:selfFullRange\n                                     options:NSStringEnumerationByComposedCharacterSequences\n                                  usingBlock:^(NSString * selfSubstring,\n                                               NSRange selfSubstringRange,\n                                               NSRange _, BOOL * _0)\n                {\n                    col++;\n                    NSUInteger newScore;\n                    if( [selfSubstring isEqualToString:targetSubstring] ){\n\n                        newScore = 1 + scores[row - 1][col - 1];\n                    }\n                    else {\n                        NSUInteger upperScore = scores[row - 1][col];\n                        NSUInteger leftScore = scores[row][col - 1];\n                        newScore = MAX(upperScore, leftScore);\n                    }\n\n                    scores[row][col] = newScore;\n                }];\n        }];\n\n    return scoresData;\n}\n\n@end\n\nint main(int argc, const char * argv[])\n{\n\n    @autoreleasepool {\n\n        NSArray * testItems = @[@{@"source" : @"Ingso\xcc\xb6ll",\n                                  @"targets": @[\n                                    @{@"string"   : @"Ingersoll",\n                                      @"score"    : @6,\n                                      @"sequence" : @"Ingsll"},\n                                    @{@"string"   : @"Boylan",\n                                      @"score"    : @1,\n                                      @"sequence" : @"n"},\n                                    @{@"string"   : @"New Ingersoll",\n                                      @"score"    : @6,\n                                      @"sequence" : @"Ingsll"}]},\n                                @{@"source" : @"Ing",\n                                  @"targets": @[\n                                         @{@"string"   : @"Ingersoll",\n                                           @"score"    : @3,\n                                           @"sequence" : @"Ing"},\n                                         @{@"string"   : @"Boylan",\n                                           @"score"    : @1,\n                                           @"sequence" : @"n"},\n                                         @{@"string"   : @"New Ingersoll",\n                                           @"score"    : @3,\n                                           @"sequence" : @"Ing"}]},\n                                @{@"source" : @"New Ing",\n                                  @"targets": @[\n                                         @{@"string"   : @"Ingersoll",\n                                           @"score"    : @3,\n                                           @"sequence" : @"Ing"},\n                                         @{@"string"   : @"Boylan",\n                                           @"score"    : @1,\n                                           @"sequence" : @"n"},\n                                         @{@"string"   : @"New Ingersoll",\n                                           @"score"    : @7,\n                                           @"sequence" : @"New Ing"}]}];\n\n        for( NSDictionary * item in testItems ){\n            NSString * source = item[@"source"];\n            for( NSDictionary * target in item[@"targets"] ){\n                NSString * targetString = target[@"string"];\n                NSCAssert([target[@"score"] integerValue] ==\n                           [source WSSLengthOfLongestCommonSubsequenceWithString:targetString],\n                          @"");\n//                NSCAssert([target[@"sequence"] isEqualToString:\n//                           [source longestCommonSubsequenceWithString:targetString]],\n//                          @"");\n            }\n        }\n\n\n    }\n    return 0;\n}\n
Run Code Online (Sandbox Code Playgroud)\n