Mos*_*she 5 string algorithm objective-c levenshtein-distance
我正在将校园内的建筑物名称与来自各种数据库的输入进行比较。人们输入这些名称,每个人都使用自己的缩写方案。我试图从用户输入到名称的规范形式中找到最佳匹配。
我已经实现了一个递归的 Levenshtein 距离方法,但是我正在尝试解决一些边缘情况。我的实现在 GitHub 上。
一些建筑物的名称是一个词,而另一些则是两个词。一个词上的一个词会产生相当准确的结果,但我需要记住两件事。
缩写:假设输入是名称的缩写版本,有时我可以在输入和任意名称之间获得相同的 Levenshtein 距离,以及正确的名称。例如,如果我的输入是“ Ing
”并且建筑物名称1. are ["Boylan", "Ingersoll", "Whitman", "Whitehead", "Roosevelt", and "Library"]
,我最终得到 6 的 LDBoylan
和Ingersoll
。想要的结果就在这里Ingersoll
。
多字字符串:第二个边缘情况是输入和/或输出是两个单词,用空格分隔。例如,New Ing
是 的缩写New Ingersoll
。在这种情况下,New Ingersoll 和 Boylan 的 Levenshtein 距离得分均为 6。如果我要拆分字符串,则完美New
匹配New
,然后我只需要参考我之前的边缘情况的解决方案。
处理这两种边缘情况的最佳方法是什么?
1. 对于好奇的人来说,这些是纽约市布鲁克林学院的建筑。
我认为你应该使用最长公共子序列的长度而不是编辑距离。对于您的情况来说,这似乎是一个更好的指标。本质上,正如我在评论中所建议的,它优先考虑插入和删除而不是替换。
\n\n它清楚地区分了“Ing”->“Ingersoll”和“Ing”->“Boylan”(得分为 3 和 1),处理空格没有问题(“New Ing”->“New Ingersoll”得分 7,其中“New Ing” ” -> “Boylan” 再次得分 1),并且如果您遇到像“Ingsl”这样的缩写,也能很好地工作。
\n\n该算法很简单。如果两个字符串的长度为m和n,则按字符比较字符串的连续前缀(从空前缀开始),将分数保留在大小为m+1, n+1的矩阵中。如果特定对匹配,则将前两个前缀的分数加一(矩阵中向上一行,留下一列);否则保留这些前缀的两个分数中的最高分数(比较紧邻上面的条目和紧邻左边的条目并取最好的)。当您检查完两个字符串后,分数矩阵中的最后一项就是 LCS 的长度。
\n\n“Ingsll”和“Ingersoll”的分数矩阵示例:
\n\n\n 0 1 2 3 4 5 6\n ngsll\n --------------\n0 | 0 0 0 0 0 0 0\n1 我 | 0 1 1 1 1 1 1\n2 n | 0 1 1 1 1 1 1\n2 0 1 2 2 2 2 2\n3 克 | 0 1 2 3 3 3 3\n4 e | 0 1 2 3 3 3 3\n5 r | 0 1 2 3 3 3 3\n6 秒 | 0 1 2 3 4 4 4\n7 o | 0 1 2 3 4 4 4\n8 l | 0 1 2 3 4 5 5\n9 l | 0 1 2 3 4 5 6\n\n\n
这是长度的 ObjC 实现。这里的大部分复杂性只是由于想要@"o\xcc\xb6"
正确处理组合字符序列(例如)。
#import <Foundation/Foundation.h>\n\n@interface NSString (WSSComposedLength)\n\n- (NSUInteger)WSSComposedLength;\n\n@end\n\n@implementation NSString (WSSComposedLength)\n\n- (NSUInteger)WSSComposedLength\n{\n __block NSUInteger length = 0;\n [self enumerateSubstringsInRange:(NSRange){0, [self length]}\n options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired\n usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {\n length++;\n }];\n\n return length;\n}\n\n@end\n\n\n@interface NSString (WSSLongestCommonSubsequence)\n\n- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;\n- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;\n\n@end\n\n@implementation NSString (WSSLongestCommonSubsequence)\n\n- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target\n{\n NSUInteger * const * scores;\n scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];\n\n return scores[[target WSSComposedLength]][[self WSSComposedLength]];\n}\n\n- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target\n{\n NSUInteger * const * scores;\n scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];\n\n //FIXME: Implement this.\n\n return nil;\n}\n\n- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{\n\n NSUInteger selfLength = [self WSSComposedLength];\n NSUInteger targetLength = [target WSSComposedLength];\n NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];\n NSUInteger ** scores = [scoresData mutableBytes];\n\n for( NSUInteger i = 0; i <= targetLength; i++ ){\n scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];\n }\n\n /* Ranges in the enumeration Block are the same measure as\n * -[NSString length] -- i.e., 16-bit code units -- as opposed to\n * _composed_ length, which counts code points. Thus:\n *\n * Enumeration will miss the last character if composed length is used\n * as the range and there\'s a substring range with length greater than one.\n */\n NSRange selfFullRange = (NSRange){0, [self length]};\n NSRange targetFullRange = (NSRange){0, [target length]};\n /* Have to keep track of these indexes by hand, rather than using the\n * Block\'s substringRange.location because, e.g., @"o\xcc\xb6", will have\n * range {x, 2}, and the next substring will be {x+2, l}.\n */\n __block NSUInteger col = 0;\n __block NSUInteger row = 0;\n [target enumerateSubstringsInRange:targetFullRange\n options:NSStringEnumerationByComposedCharacterSequences\n usingBlock:^(NSString * targetSubstring,\n NSRange targetSubstringRange,\n NSRange _, BOOL * _0)\n {\n row++;\n col = 0;\n\n [self enumerateSubstringsInRange:selfFullRange\n options:NSStringEnumerationByComposedCharacterSequences\n usingBlock:^(NSString * selfSubstring,\n NSRange selfSubstringRange,\n NSRange _, BOOL * _0)\n {\n col++;\n NSUInteger newScore;\n if( [selfSubstring isEqualToString:targetSubstring] ){\n\n newScore = 1 + scores[row - 1][col - 1];\n }\n else {\n NSUInteger upperScore = scores[row - 1][col];\n NSUInteger leftScore = scores[row][col - 1];\n newScore = MAX(upperScore, leftScore);\n }\n\n scores[row][col] = newScore;\n }];\n }];\n\n return scoresData;\n}\n\n@end\n\nint main(int argc, const char * argv[])\n{\n\n @autoreleasepool {\n\n NSArray * testItems = @[@{@"source" : @"Ingso\xcc\xb6ll",\n @"targets": @[\n @{@"string" : @"Ingersoll",\n @"score" : @6,\n @"sequence" : @"Ingsll"},\n @{@"string" : @"Boylan",\n @"score" : @1,\n @"sequence" : @"n"},\n @{@"string" : @"New Ingersoll",\n @"score" : @6,\n @"sequence" : @"Ingsll"}]},\n @{@"source" : @"Ing",\n @"targets": @[\n @{@"string" : @"Ingersoll",\n @"score" : @3,\n @"sequence" : @"Ing"},\n @{@"string" : @"Boylan",\n @"score" : @1,\n @"sequence" : @"n"},\n @{@"string" : @"New Ingersoll",\n @"score" : @3,\n @"sequence" : @"Ing"}]},\n @{@"source" : @"New Ing",\n @"targets": @[\n @{@"string" : @"Ingersoll",\n @"score" : @3,\n @"sequence" : @"Ing"},\n @{@"string" : @"Boylan",\n @"score" : @1,\n @"sequence" : @"n"},\n @{@"string" : @"New Ingersoll",\n @"score" : @7,\n @"sequence" : @"New Ing"}]}];\n\n for( NSDictionary * item in testItems ){\n NSString * source = item[@"source"];\n for( NSDictionary * target in item[@"targets"] ){\n NSString * targetString = target[@"string"];\n NSCAssert([target[@"score"] integerValue] ==\n [source WSSLengthOfLongestCommonSubsequenceWithString:targetString],\n @"");\n// NSCAssert([target[@"sequence"] isEqualToString:\n// [source longestCommonSubsequenceWithString:targetString]],\n// @"");\n }\n }\n\n\n }\n return 0;\n}\n
Run Code Online (Sandbox Code Playgroud)\n