Sta*_*ses 7 java algorithm levenshtein-distance
我有两个多行字符串.我正在使用以下代码来确定其中两个之间的相似性.这利用了Levenshtein距离算法.
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) {
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
Run Code Online (Sandbox Code Playgroud)
但上面的代码没有按预期工作.
比如让说,我们已经得到了以下两个字符串说s1
和s2
,
S1 - > How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?
S2-> How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?
然后我将上面的字符串传递给相似性方法,但它没有找到确切的差异百分比.如何优化算法?
以下是我的主要方法
更新:
public static boolean authQuestion(String question) throws SQLException{
boolean isQuestionAvailable = false;
Connection dbCon = null;
try {
dbCon = MyResource.getConnection();
String query = "SELECT * FROM WORDBANK where WORD ~* ?;";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
double re=similarity( rs.getString("question"), question);
if(re > 0.6){
isQuestionAvailable = true;
}else {
isQuestionAvailable = false;
}
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (dbCon != null)
dbCon.close();
} finally {
if (dbCon != null)
dbCon.close();
}
return isQuestionAvailable;
}
Run Code Online (Sandbox Code Playgroud)
我可以建议你一个方法......
您正在使用编辑距离,它为您提供S1中需要更改/添加/删除的字符数,以便将其转换为S2.
所以,例如:
S1 = "abc"
S2 = "cde"
Run Code Online (Sandbox Code Playgroud)
编辑距离为3,它们是100%不同(考虑到你通过char比较在某种char中看到它).
所以,如果你这样做,你可以有一个大概的百分比
S1 = "abc"
S2 = "cde"
edit = edit_distance(S1, S2)
percentage = min(edit/S1.length(), edit/S2.length())
Run Code Online (Sandbox Code Playgroud)
min是一种解决方法,用于处理字符串非常不同的情况,例如:
S1 = "abc"
S2 = "defghijklmno"
Run Code Online (Sandbox Code Playgroud)
因此编辑距离将大于S1的长度,百分比应该大于100%,因此可能除以更大的尺寸应该更好.
希望有所帮助