Hri*_*sto 37 java algorithm performance trie levenshtein-distance
完成.下面是最终通过我所有测试的代码.再次,这是模仿Murilo Vasconcelo的Steve Hanov算法的修改版本.感谢所有帮助!
/**
* Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
* words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
* distance using a Trie" and Murilo Vasconcelo's revised version in C++.
*
* http://stevehanov.ca/blog/index.php?id=114
* http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
*
* @param ArrayList<Character> word - the characters of an input word as an array representation
* @return int - the minimum Levenshtein Distance
*/
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int iWordLength = word.size();
int[] currentRow = new int[iWordLength + 1];
for (int i = 0; i <= iWordLength; i++) {
currentRow[i] = i;
}
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
return theTrie.minLevDist;
}
/**
* Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
*
* @param TrieNode node - the current TrieNode
* @param char letter - the current character of the current word we're working with
* @param ArrayList<Character> word - an array representation of the current word
* @param int[] previousRow - a row in the Levenshtein Distance matrix
*/
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int minimumElement = currentRow[0];
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
if (currentRow[i] < minimumElement) {
minimumElement = currentRow[i];
}
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minimumElement < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
traverseTrie(node.children.get(c), c, word, currentRow);
}
}
}
Run Code Online (Sandbox Code Playgroud)
最后,我已经成功地将其用于大多数测试用例.我的实现实际上是从直接翻译穆里罗的C++版本的史蒂夫Hanov的算法.那么我该如何重构这个算法和/或进行优化呢?以下是代码......
public int search(String word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.charAt(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minElement(currentRow) < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
searchRec(node.children.get(c), c, word, currentRow);
}
}
}
Run Code Online (Sandbox Code Playgroud)
谢谢所有为此问题做出贡献的人.我试着让Levenshtein Automata工作,但我无法实现.
所以我正在寻找有关上述代码的重构和/或优化的建议.如果有任何混淆,请告诉我.与往常一样,我可以根据需要提供其余的源代码.
所以我实现了一个简单的Trie数据结构,我一直在尝试按照Steve Hanov的python教程来计算Levenshtein距离.实际上,我有兴趣计算给定单词和Trie中单词之间的最小 Levenshtein距离,因此我一直在关注Murilo Vasconcelos的Steve Hanov算法版本.这不是很好,但这是我的Trie课程:
public class Trie {
public TrieNode root;
public int minLevDist;
public Trie() {
this.root = new TrieNode(' ');
}
public void insert(String word) {
int length = word.length();
TrieNode current = this.root;
if (length == 0) {
current.isWord = true;
}
for (int index = 0; index < length; index++) {
char letter = word.charAt(index);
TrieNode child = current.getChild(letter);
if (child != null) {
current = child;
} else {
current.children.put(letter, new TrieNode(letter));
current = current.getChild(letter);
}
if (index == length - 1) {
current.isWord = true;
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
...和TrieNode类:
public class TrieNode {
public final int ALPHABET = 26;
public char letter;
public boolean isWord;
public Map<Character, TrieNode> children;
public TrieNode(char letter) {
this.isWord = false;
this.letter = letter;
children = new HashMap<Character, TrieNode>(ALPHABET);
}
public TrieNode getChild(char letter) {
if (children != null) {
if (children.containsKey(letter)) {
return children.get(letter);
}
}
return null;
}
}
Run Code Online (Sandbox Code Playgroud)
现在,我试图实现搜索,因为Murilo Vasconcelos有它,但有些东西已经关闭,我需要一些帮助调试这个.请提供有关如何重构和/或指出错误位置的建议.我想重构的第一件事是"minCost"全局变量,但这是最小的事情.无论如何,这是代码......
public void search(String word) {
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int replace, insertCost, deleteCost;
for (int i = 1; i < size; i++) {
char c = word.charAt(i - 1);
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);
currentRow[i] = minimum(insertCost, deleteCost, replace);
}
if (currentRow[size - 1] < minCost && !node.isWord) {
minCost = currentRow[size - 1];
}
Integer minElement = minElement(currentRow);
if (minElement < minCost) {
for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
searchRec(node, entry.getKey(), word, currentRow);
}
}
}
Run Code Online (Sandbox Code Playgroud)
我为缺乏评论而道歉.那么我做错了什么?
我一直在阅读一篇文章,使用Trie快速简便的Levenshtein距离,希望找到一种有效的方法来计算两个弦之间的Levenshtein距离.我的主要目标是,在一大堆单词的情况下,能够找到输入单词和这组单词之间的最小Levenshtein距离.
在我琐碎的实现中,我为每个输入单词计算输入单词和单词集之间的Levenshtein距离,并返回最小值.它有效,但效率不高......
我一直在寻找Java中Trie的实现,我遇到了两个看似很好的资源:
但是,这些实现对于我正在尝试的事情来说似乎太复杂了.正如我一直在阅读它们以了解它们如何工作以及Trie数据结构如何工作一般,我只会变得更加困惑.
那么我如何在Java中实现一个简单的Trie数据结构呢?我的直觉告诉我每个TrieNode应该存储它所代表的String,并且还引用字母表中的字母,而不是所有字母.我的直觉是否正确?
一旦实现,下一个任务是计算Levenshtein距离.我在上面的文章中阅读了Python代码示例,但我不会说Python,而且一旦我进行了递归搜索,我的Java实现就会耗尽堆内存.那么如何使用Trie数据结构计算Levenshtein距离?我有一个简单的实现,模仿这个源代码,但它不使用Trie ...它是低效的.
除了你的评论和建议之外,看到一些代码真的很棒.毕竟,这对我来说是一个学习过程......我从来没有实现过Trie ......所以我有很多东西要学习这个经验.
谢谢.
ps如果需要,我可以提供任何源代码.此外,我已经阅读并尝试使用Nick Johnson博客中建议的BK-Tree ,但它的效率不如我想的那样......或者我的实现可能是错误的.
Rob*_*ert 10
从我可以告诉你不需要提高Levenshtein Distance的效率,你需要将你的字符串存储在一个结构中,这个结构阻止你需要多次运行距离计算,即通过修剪搜索空间.
由于Levenshtein距离是一个度量,你可以使用利用三角不等式的任何度量空间索引 - 你提到了BK-Trees,但还有其他例如.Vantage Point Trees,Fixed-Queries Tree,Bisector Trees,Spatial Approximation Trees.以下是他们的描述:
Burkhard-Keller树
节点按如下方式插入树中:对于根节点,从空间中选择一个任意元素; 添加唯一的边标记子项,使每条边的值是从枢轴到该元素的距离; 递归应用,在边缘已存在时选择子项作为轴.
固定查询树
与BKT一样,除了:元素存储在树叶上; 每片叶子都有多个元素; 对于树的每个级别,使用相同的枢轴.
Bisector树
每个节点包含两个枢轴元素及其覆盖半径(中心元素与其任何子树元素之间的最大距离); 将最接近第一个轴的元素和最接近第二个轴的元素过滤成两组,并从这些集中递归地构建两个子树.
空间逼近树
最初所有元素都放在一个袋子里; 选择一个任意元素作为枢轴; 在枢轴范围内建立最近邻居的集合; 将每个剩余的元素放入刚刚建成的集合中最近元素的包中; 递归地从该集合的每个元素形成子树.
华帝点树
从套装中选择一个支点; 计算此枢轴与剩余集合的每个元素之间的中间距离; 将集合中的元素过滤为左右递归子树,使得距离小于或等于中值的那些形成左边,而更大的那些形成右边.
| 归档时间: |
|
| 查看次数: |
16320 次 |
| 最近记录: |