Log*_*nke 4 string character swift
I have a function in Swift that computes the hamming distance of two strings and then puts them into a connected graph if the result is 1.
For example, read to hear returns a hamming distance of 2 because read[0] != hear[0] and read[3] != hear[3].
At first, I thought my function was taking a long time because of the quantity of input (8,000+ word dictionary), but I knew that several minutes was too long. So, I rewrote my same algorithm in Java, and the computation took merely 0.3s.
I have tried writing this in Swift two different ways:
extension String {
subscript (i: Int) -> String {
return self[Range(i ..< i + 1)]
}
}
private func getHammingDistance(w1: String, w2: String) -> Int {
if w1.length != w2.length { return -1 }
var counter = 0
for i in 0 ..< w1.length {
if w1[i] != w2[i] { counter += 1 }
}
return counter
}
Run Code Online (Sandbox Code Playgroud)
Results: 434 seconds
private func getHammingDistance(w1: String, w2: String) -> Int {
if w1.length != w2.length { return -1 }
var counter = 0
var c1 = w1, c2 = w2 // need to mutate
let length = w1.length
for i in 0 ..< length {
if c1.removeFirst() != c2.removeFirst() { counter += 1 }
}
return counter
}
Run Code Online (Sandbox Code Playgroud)
Results: 156 seconds
Results: 0.3 seconds
var graph: Graph
func connectData() {
let verticies = graph.canvas // canvas is Array<Node>
// Node has key that holds the String
for vertex in 0 ..< verticies.count {
for compare in vertex + 1 ..< verticies.count {
if getHammingDistance(w1: verticies[vertex].key!, w2: verticies[compare].key!) == 1 {
graph.addEdge(source: verticies[vertex], neighbor: verticies[compare])
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
156 seconds is still far too inefficient for me. What is the absolute most efficient way of comparing characters in Swift? Is there a possible workaround for computing hamming distance that involves not comparing characters?
Edit 1: I am taking an entire dictionary of 4 and 5 letter words and creating a connected graph where the edges indicate a hamming distance of 1. Therefore, I am comparing 8,000+ words to each other to generate edges.
Edit 2: Added method call.
除非您为字符串选择固定长度的字符模型,否则 .count 和 .characters 等方法和属性的复杂度将为 O(n) 或最多 O(n/2)(其中 n 是字符串长度)。如果您将数据存储在字符数组中(例如 [Character] ),您的函数会执行得更好。
您还可以使用 zip() 函数将整个计算合并到一次中
let hammingDistance = zip(word1.characters,word2.characters)
.filter{$0 != $1}.count
Run Code Online (Sandbox Code Playgroud)
但这仍然需要遍历每个单词对的所有字符。
...
鉴于您只寻找汉明距离 1,有一种更快的方法来获取所有唯一的单词对:
该策略是按照与一个“缺失”字母相对应的 4(或 5)种模式对单词进行分组。这些模式组中的每一个都定义了较小的单词对范围,因为不同组中的单词之间的距离不为 1。
每个单词将属于与其字符数一样多的组。
例如 :
"hear" will be part of the pattern groups:
"*ear", "h*ar", "he*r" and "hea*".
Run Code Online (Sandbox Code Playgroud)
与这 4 个模式组之一相对应的任何其他单词与“hear”的汉明距离均为 1。
这是如何实现的:
// Test data 8500 words of 4-5 characters ...
var seenWords = Set<String>()
var allWords = try! String(contentsOfFile: "/usr/share/dict/words")
.lowercased()
.components(separatedBy:"\n")
.filter{$0.characters.count == 4 || $0.characters.count == 5}
.filter{seenWords.insert($0).inserted}
.enumerated().filter{$0.0 < 8500}.map{$1}
// Compute patterns for a Hamming distance of 1
// Replace each letter position with "*" to create patterns of
// one "non-matching" letter
public func wordH1Patterns(_ aWord:String) -> [String]
{
var result : [String] = []
let fullWord : [Character] = aWord.characters.map{$0}
for index in 0..<fullWord.count
{
var pattern = fullWord
pattern[index] = "*"
result.append(String(pattern))
}
return result
}
// Group words around matching patterns
// and add unique pairs from each group
func addHamming1Edges()
{
// Prepare pattern groups ...
//
var patternIndex:[String:Int] = [:]
var hamming1Groups:[[String]] = []
for word in allWords
{
for pattern in wordH1Patterns(word)
{
if let index = patternIndex[pattern]
{
hamming1Groups[index].append(word)
}
else
{
let index = hamming1Groups.count
patternIndex[pattern] = index
hamming1Groups.append([word])
}
}
}
// add edge nodes ...
//
for h1Group in hamming1Groups
{
for (index,sourceWord) in h1Group.dropLast(1).enumerated()
{
for targetIndex in index+1..<h1Group.count
{ addEdge(source:sourceWord, neighbour:h1Group[targetIndex]) }
}
}
}
Run Code Online (Sandbox Code Playgroud)
在我的 2012 款 MacBook Pro 上,8500 个单词在 0.12 秒内经过 22817 个(唯一的)边缘对。
[编辑]为了说明我的第一点,我使用字符数组而不是字符串制作了一个“强力”算法:
let wordArrays = allWords.map{Array($0.unicodeScalars)}
for i in 0..<wordArrays.count-1
{
let word1 = wordArrays[i]
for j in i+1..<wordArrays.count
{
let word2 = wordArrays[j]
if word1.count != word2.count { continue }
var distance = 0
for c in 0..<word1.count
{
if word1[c] == word2[c] { continue }
distance += 1
if distance > 1 { break }
}
if distance == 1
{ addEdge(source:allWords[i], neighbour:allWords[j]) }
}
}
Run Code Online (Sandbox Code Playgroud)
这将在 0.27 秒内遍历唯一对。速度差异的原因是 Swift Strings 的内部模型,它实际上不是一个等长元素(字符)的数组,而是一个不同长度编码字符的链(类似于 UTF 模型,其中特殊字节表示以下 2 或3 个字节是单个字符的一部分。这种结构没有简单的 Base+Displacement 索引,必须始终从头开始迭代以到达第 N 个元素。
请注意,我使用 unicodeScalars 而不是 Character,因为它们是字符的 16 位固定长度表示形式,允许直接二进制比较。字符类型并不那么简单,需要更长的时间来比较。